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Preface 


This is a book about Monte Carlo methods from the perspective of financial 
engineering. Monte Carlo simulation has become an essential tool in the pric- 
ing of derivative securities and in risk management; these applications have, 
in turn, stimulated research into new Monte Carlo techniques and renewed 
interest in some old techniques. This is also a book about financial engineer- 
ing from the perspective of Monte Carlo methods. One of the best ways to 
develop an understanding of a model of, say, the term structure of interest 
rates is to implement a simulation of the model; and finding ways to improve 
the efficiency of a simulation motivates a deeper investigation into properties 
of a model. 

My intended audience is a mix of graduate students in financial engi- 
neering, researchers interested in the application of Monte Carlo methods in 
finance, and practitioners implementing models in industry. This book has 
grown out of lecture notes I have used over several years at Columbia, for 
a semester at Princeton, and for a short course at Aarhus University. These 
classes have been attended by masters and doctoral students in engineering, 
the mathematical and physical sciences, and finance. The selection of topics 
has also been influenced by my experiences in developing and delivering pro- 
fessional training courses with Mark Broadie, often in collaboration with Leif 
Andersen and Phelim Boyle. The opportunity to discuss the use of Monte 
Carlo methods in the derivatives industry with practitioners and colleagues 
has helped shaped my thinking about the methods and their application. 

Students and practitioners come to the area of financial engineering from 
diverse academic fields and with widely ranging levels of training in mathe- 
matics, statistics, finance, and computing. This presents a challenge in set- 
ting the appropriate level for discourse. The most important prerequisite for 
reading this book is familiarity with the mathematical tools routinely used 
to specify and analyze continuous-time models in finance. Prior exposure to 
the basic principles of option pricing is useful but less essential. The tools 
of mathematical finance include It6 calculus, stochastic differential equations, 
and martingales. Perhaps the most advanced idea used in many places in 


v1 


shis book is the concept of a change of measure. This idea is so central both 
xo derivatives pricing and to Monte Carlo methods that there is simply no 
voiding it. The prerequisites to understanding the statement of the Girsanov 
yheorem should suffice for reading this book. 

Whereas the language of mathematical finance is essential to our topic, its 
‘echnical subtleties are less so for purposes of computational work. My use of 
nathematical tools is often informal: I may assume that a local martingale 
s a martingale or that a stochastic differential equation has a solution, for 
xample, without calling attention to these assumptions. Where convenient, 

take derivatives without first assuming differentiability and I take expecta- 
ions without verifying integrability. My intent is to focus on the issues most 
mportant to Monte Carlo methods and to avoid diverting the discussion to 
pell out technical conditions. Where these conditions are not evident and 
vhere they are essential to understanding the scope of a technique, I discuss 
hem explicitly. In addition, an appendix gives precise statements of the most 
mportant tools from stochastic calculus. | 

This book divides roughly into three parts. The first part, Chapters 1-3, 
levelops fundamentals of Monte Carlo methods. Chapter 1 summarizes the 
heoretical foundations of derivatives pricing and Monte Carlo. It explains 
he principles by which a pricing problem can be formulated as an integra- 
ion problem to which Monte Carlo is then applicable. Chapter 2 discusses 
andom number generation and methods for sampling from nonuniform dis- 
ributions, tools fundamental to every application of Monte Carlo. Chapter 3 
rovides an overview of some of the most important models used in financial 
ngineering and discusses their implementation by simulation. I have included 
aore discussion of the models in Chapter 3 and the financial underpinnings 
2 Chapter 1 than is strictly necessary to run a simulation. Students often 
ome to a course in Monte Carlo with limited exposure to this material, and 
he implementation of a simulation becomes more meaningful if accompanied 
y an understanding of a model and its context. Moreover, it is precisely in 
iodel details that many of the most interesting simulation issues arise. 

If the first three chapters deal with running a simulation, the next three 
eal with ways of running it better. Chapter 4 presents methods for increas- 
1g precision by reducing the variance of Monte Carlo estimates. Chapter 5 
iscusses the application of deterministic guastMonte Carlo methods for nu- 
1erical integration. Chapter 6 addresses the problem of discretization error 
aat results from simulating discrete-time approximations to continuous-time 
10dels. 

The last three chapters address topics specific to the application of Monte 
arlo methods in finance. Chapter 7 covers methods for estimating price sen- 
tivities or “Greeks.” Chapter 8 deals with the pricing of American options, 
hich entails solving an optimal stopping problem within a simulation. Chap- 
ar 9 is an introduction to the use of Monte Carlo methods in risk management. 
, discusses the measurement of market risk and credit risk in financial port- 
lios. The models and methods of this final chapter are rather different from 
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those in the other chapters, which deal primarily with the pricing of derivative 
securities. 

Several people have influenced this book in various ways and it is my 
pleasure to express my thanks to them here. I owe a particular debt to my 
frequent collaborators and co-authors Mark Broadie, Phil Heidelberger, and 
Perwez Shahabuddin. Working with them has influenced my thinking as well 
as the book’s contents. With Mark Broadie I have had several occasions to 
collaborate on teaching as well as research, and I have benefited from our many 
discussions on most of the topics in this book. Mark, Phil Heidelberger, Steve 
Kou, Pierre L’Ecuyer, Barry Nelson, Art Owen, Philip Protter, and Jeremy 
Staum each commented on one or more draft chapters; I thank them for 
their comments and apologize for the many good suggestions I was unable to 
incorporate fully. I have also benefited from working with current and former 
Columbia students Jingyi Li, Nicolas Merener, Jeremy Staum, Hui Wang, Bin 
Yu, and Xiaoliang Zhao on some of the topics in this book. Several classes 
of students helped uncover errors in the lecture notes from which this book 


evolved. 


Paul Glasserman 
New York, 2003 
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Foundations 


This chapter’s two parts develop key ideas from two fields, the intersection of 
which is the topic of this book. Section 1.1 develops principles underlying the 
use and analysis of Monte Carlo methods. It begins with a general descrip- 
tion and simple examples of Monte Carlo, and then develops a framework for 
measuring the efficiency of Monte Carlo estimators. Section 1.2 reviews con- 
cepts from the theory of derivatives pricing, including pricing by replication, 
the absence of arbitrage, risk-neutral probabilities, and market completeness. 
The most important idea for our purposes is the representation of derivative 
prices as expectations, because this representation underlies the application 
of Monte Carlo. 


1.1 Principles of Monte Carlo 


1.1.1 Introduction 


Monte Carlo methods are based on the analogy between probability and vol- 
ume. The mathematics of measure formalizes the intuitive notion of probabil- 
ity, associating an event with a set of outcomes and defining the probability of 
the event to be its volume or measure relative to that of a universe of possible 
outcomes. Monte Carlo uses this identity in reverse, calculating the volume 
of a set by interpreting the volume as a probability. In the simplest case, this 
means sampling randomly from a universe of possible outcomes and taking 
the fraction of random draws that fall in a given set as an estimate of the set’s 
volume. The law of large numbers ensures that this estimate converges to the 
correct value as the number of draws increases. The central limit theorem 
provides information about the likely magnitude of the error in the estimate 
after a finite number of draws. 

A small step takes us from volumes to integrals. Consider, for example, 
the problem of estimating the integral of a function f over the unit interval. 


We. may represent the integral 


a a VULAVAWULY ILO 


a= f Hode 


as an expectation E/f(U)], with U uniformly distributed between 0 and 1. 
Suppose we have a mechanism for drawing points U1, U2,... independently 
and uniformly from [0,1]. Evaluating the function f at n of these random 
points and averaging the results produces the Monte Carlo estimate 


=a 


n 
If f is indeed integrable over [0,1] then, by the strong law of large numbers, 
Qn —>& with probability 1 as n > oo. 


If f is in fact square integrable and we set 


o} = | (fe) - a) ae 


then the error ân — a in the Monte Carlo estimate is approximately normally 
distributed with mean 0 and standard deviation of/,/n, the quality of this 
approximation improving with increasing n. The parameter of would typically 
be unknown in a setting in which a is unknown, but it can be estimated using 
the sample standard deviation 


Thus, from the function values f(Ui),...,f(Un) we obtain not only an esti- 
mate of the integral a but also a measure of the error in this estimate. 

The form of the standard error o7/,\/n is a central feature of the Monte 
Carlo method. Cutting this error in half requires increasing the number of 
points by a factor of four; adding one decimal place of precision requires 
LOO times as many points. These are tangible expressions of the square-root 
onvergence rate implied by the ,/n in the denominator of the standard error. 
n contrast, the error in the simple trapezoidal rule 


n—-l 


_ fO+ fC) 1x | 
cules tm 


; O(n~*), at least for twice continuously differentiable f. Monte Carlo is 
enerally not a competitive method for calculating one-dimensional integrals. 

The value of Monte Carlo as a computational tool lies in the fact that its 
(n~1/2) convergence rate is not restricted to integrals over the unit interval: 


1.1 Principles of Monte Carlo 3 


Indeed, the steps outlined above extend to estimating an integral over (0, 1]¢ 
(and even R) for all dimensions d. Of course, when we change dimensions we 
change f and when we change f we change o f, but the standard error will still 
have the form of /,/n for a Monte Carlo estimate based on n draws from the 
domain (0, 1]%. In particular, the O(n~!/2 ) convergence rate holds for all d. In 
contrast, the error in a product trapezoidal rule in d dimensions is O(n? 4) for 
twice continuously differentiable integrands; this degradation in convergence 
rate with increasing dimension is characteristic of all deterministic integration 
methods. Thus, Monte Carlo methods are attractive in evaluating integrals in 
high dimensions. 

What does this have to do with financial engineering? A fundamental im- 
plication of asset pricing theory is that under certain circumstances (reviewed 
in Section 1.2.1), the price of a derivative security can be usefully represented 
as an expected value. Valuing derivatives thus reduces to computing expecta- 
tions. In many cases, if we were to write the relevant expectation as an integral, 
we would find that its dimension is large or even infinite. This is precisely the 
sort of setting in which Monte Carlo methods become attractive. 

Valuing a derivative security by Monte Carlo typically involves simulating 
paths of stochastic processes used to describe the evolution of underlying 
asset prices, interest rates, model parameters, and other factors relevant to 
the security in question. Rather than simply drawing points randomly from 
[0,1] or [0,1]*, we seek to sample from a space of paths. Depending on how 
the problem and model are formulated, the dimension of the relevant space 
may be large or even infinite. The dimension will ordinarily be at least as large 
as the number of time steps in the simulation, and this could easily be large 
enough to make the square-root convergence rate for Monte Carlo competitive 
with alternative methods. 

For the most part, there is nothing we can do to overcome the rather slow 
rate of convergence characteristic of Monte Carlo. (The quasi-Monte Carlo 
methods discussed in Chapter 5 are an exception — under appropriate con- 
ditions they provide a faster convergence rate.) We can, however, look for 
superior sampling methods that reduce the implicit constant in the conver- 
gence rate. Much of this book is devoted to examples and general principles 
for doing this. 

The rest of this section further develops some essential ideas underly- 
ing Monte Carlo methods and their application to financial engineering. Sec- 
tion 1.1.2 illustrates the use of Monte Carlo with two simple types of option 
contracts. Section 1.1.3 develops a framework for evaluating the efficiency of 


simulation estimators. 


1.1.2 First Examples 


In discussing general principles of Monte Carlo, it is useful to have some simple 
specific examples to which to refer. As a first illustration of a Monte Carlo 
method, we consider the calculation of the expected present value of the payoff 
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of a call option on a stock. We do not yet refer to this as the option price; the 
connection between a price and an expected discounted payoff is developed in 
Section 1.2.1. 

Let S(t) denote the price of the stock at time t. Consider a call option 
granting the holder the right to buy the stock at a fixed price K at a fixed 
time T in the future; the current time is t = O. If at time T the stock price 
S(T) exceeds the strike price K, the holder exercises the option for a profit 
of S(T) — K; if, on the other hand, S(T} < K, the option expires worthless. 
(This is a European option, meaning that it can be exercised only at the fixed 
date T; an American option allows the holder to choose the time of exercise. ) 
The payoff to the option holder at time T is thus 


(S(T) — K)t = max{0, S(T) — K}. 
To get the present value of this payoff we multiply by a discount factor e~”7, 
with r a continuously compounded interest rate. We denote the expected 
present value by Efe" T (S(T) — K)*]. 

For this expectation to be meaningful, we need to specify the distribution 
of the random variable S(T), the terminal stock price. In fact, rather than 
simply specifying the distribution at a fixed time, we introduce a model for the 
dynamics of the stock price. The Black-Scholes model describes the evolution 
of the stock price through the stochastic differential equation (SDE) 


a = rdt +o dW (t), (1.1) 


with W a standard Brownian motion. (For a brief review of stochastic cal- 
culus, see Appendix B.) This equation may be interpreted as modeling the 
percentage changes dS/S in the stock price as the increments of a Brownian 
motion. The parameter ø is the volatility of the stock price and the coefficient 
on dt in (1.1) is the mean rate of return. In taking the rate of return to be 
the same as the interest rate r, we are implicitly describing the risk-neutral 
dynamics of the stock price, an idea reviewed in Section 1.2.1. 

The solution of the stochastic differential equation (1.1) is 


S(T) = S(0) exp ([r — $07|T + oW(T)). (1.2) 


As 5(0) is the current price of the stock, we may assume it is known. The 
random variable W (T) is normally distributed with mean 0 and variance T; 
this is also the distribution of VTZ if Z is a standard normal random variable 
(mean 0, variance 1). We may therefore represent the terminal stock price as 


S(T) = S(0) exp (Ir sto jTi oVTZ) | (1.3) 


The logarithm of the stock price is thus normally distributed, and the stock 
price itself has a lognormal distribution. 
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The expectation E[e~"’ (S(T) — K)*] is an integral with respect to the 
lognormal density of S(T). This integral can be evaluated in terms of the 
standard normal cumulative distribution function ® as BS(S(0),0,7,r, K) 
with 


BS(S,0,T,7r, K) = 


log(S/K) +(r+30°)T\ _, log(S/K) + (r — 30°) T 
so (ee) "Ke ( E Jas 


This is the Black-Scholes [50] formula for a call option. 

In light of the availability of this formula, there is no need to use Monte 
Carlo to compute Eļe ~"? (S(T)—K)*]. Moreover, we noted earlier that Monte 
Carlo is not a competitive method for computing one-dimensional integrals. 
Nevertheless, we now use this example to illustrate the key steps in Monte 
Carlo. From (1.3) we see that to draw samples of the terminal stock price S(T) 
it suffices to have a mechanism for drawing samples from the standard normal 
distribution. Methods for doing this are discussed in Section 2.3; for now we 
simply assume the ability to produce a sequence Z1, Z2,... of independent 
standard normal random variables. Given a mechanism for generating the Ži, 
we can estimate Ele~"? (S(T) — K)*] using the following algorithm: 


ihe) ae = Lumen 
generate Zi 
set S;(T') = S(0) exp (Ir - o°]T + ovTZ:) 
set C; = e™T(S(T)-— K)* 

set Cn = (Ci +--+ Cn)/n 


For any n > 1, the estimator Cc. is unbiased, in the sense that its expec- 
tation is the target quantity: 


E[C,,] = C = Ele"? (S(T) — K)"]. 
The estimator is strongly consistent, meaning that as n — ov, 
C290 sath probability 1. 
For finite but at least moderately large n, we can supplement the point esti- 


mate CG. with a confidence interval. Let 


(1.5) 


denote the sample standard deviation of C),...,C, and let zs denote the 1— ô 
quantile of the standard normal distribution (i.e., ®(z3) = 1 — ô). Then 


A 


SC 
yn 
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is an asymptotically (as n — oo) valid 1 — ô confidence interval for C. (For 
a 95% confidence interval, ô = .05 and z5/2 ~ 1.96.) Alternatively, because 
the standard deviation is estimated rather than known, we may prefer to 
replace 25/2 with the corresponding quantile from the ¢ distribution with n—1 
degrees of freedom, which results in a slightly wider interval. In either case, 
the probability that the interval covers C approaches 1 — 6 as n — oo. (These 
ideas are reviewed in Appendix A.) 

The problem of estimating E[e~"’ (S(T) — K)t] by Monte Carlo is simple 
enough to be illustrated in a spreadsheet. Commercial spreadsheet software 
typically includes a method for sampling from the normal distribution and 
the mathematical functions needed to transform normal samples to terminal 
stock prices and then to discounted option payoffs. Figure 1.1 gives a schematic 
illustration. The Z; are samples from the normal distribution; the comments 
in the spreadsheet illustrate the formulas used to transform these to arrive 
at the estimate C,,. The spreadsheet layout in Figure 1.1 makes the method 
transparent but has the drawback that it requires storing all n replication in 
n rows of cells. It is usually possible to use additional spreadsheet commands 
to recalculate cell values n times without storing intermediate values. 
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Fig. 1.1. A spreadsheet for estimating the expected present value of the payoff of 
a call option. 


This simple example illustrates a general feature of Monte Carlo methods 
for valuing derivatives, which is that the simulation is built up in layers: each 
of the transformations 


exemplifies a typical layer. The first transformation constructs a path of under- 
lying assets from random variables with simpler distributions and the second 
calculates a discounted payoff from each path. In fact, we often have additional 
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layers above and below these. At the lowest level, we typically start from in- 
dependent random variables U; uniformly distributed between 0 and 1, so we 
need a transformation taking the U; to Z;. The transformation taking the C; 
to the sample mean C, and sample standard deviation sc may be viewed as 
another layer. We include another still higher level in, for example, valuing a 
portfolio of instruments, each of which is valued by Monte Carlo. Randomness 
(or apparent randomness) typically enters only at the lowest layer; the sub- 
sequent transformations producing asset paths, payoffs, and estimators are 
usually deterministic. 


Path-Dependent Example 


The payoff of a standard European call option is determined by the terminal 
stock price S(T) and does not otherwise depend on the evolution of S(t) 
between times 0 and T. In estimating Efe~"* (S(T) — K)*], we were able to 
jump directly from time 0 to time T using (1.3) to sample values of S(T). 
Each simulated “path” of the underlying asset thus consists of just the two 
points S(0) and S(T). 

In valuing more complicated derivative securities using more complicated 
models of the dynamics of the underlying assets, it is often necessary to sim- 
ulate paths over multiple intermediate dates and not just at the initial and 
terminal dates. Two considerations may make this necessary: 


o the payoff of a derivative security may depend explicitly on the values of 
underlying assets at multiple dates; 

o wemay not know how to sample transitions of the underlying assets exactly 
and thus need to divide a time interval [0,7] into smaller subintervals to 
obtain a more accurate approximation to sampling from the distribution 


at time T. 


In many cases, both considerations apply. 

Before turning to a detailed example of the first case, we briefly illustrate 
the second. Consider a generalization of the basic model (1.1) in which the 
dynamics of the underlying asset S(t) are given by 


dS(t) = rS(t) dt + o(S(t))S(t) dW (t). (1.7) 


In other words, we now let the volatility o depend on the current level of S. 
Except in very special cases, this equation does not admit an explicit solution 
of the type in (1.2) and we do not have an exact mechanism for sampling from 
the distribution of S(T). In this setting, we might instead partition [0, T] into 
m subintervals of length At = T/m and over each subinterval |t,t + At] 
simulate a transition using a discrete (Euler) approximation to (1.7) of the 
form 


S(t + At) = S(t) + rS(t)At + o(S(t)) S(t) VAtZ, 
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with Z a standard normal random variable. This relies on the fact that W (t + 
At) — W(t) has mean 0 and standard deviation v'At. For each step, we would 
use an independent draw from the normal distribution. Repeating this for 
m steps produces a value of S(T) whose distribution approximates the exact 
(unknown) distribution of S(T) implied by (1.7). We expect that as m becomes 
larger (so that At becomes smaller) the approximating distribution of S(T) 
draws closer to the exact distribution. In this example, intermediate times 
are introduced into the simulation to reduce discretization error, the topic of 
Chapter 6. 

Even if we assume the dynamics in (1.1) of the Black-Scholes model, it 
may be necessary to simulate paths of the underlying asset if the payoff of a 
derivative security depends on the value of the underlying asset at interme- 
diate dates and not just the terminal value. Asian options are arguably the 
simplest path-dependent options for which Monte Carlo is a competitive com- 
putational tool. These are options with payoffs that depend on the average 


level of the underlying asset. This includes, for example, the payoff (S — K)* 
with 7 

5225 50) (1.8) 
m 

for some fixed set of dates 0 = tọ < tı <--- < tm = T, with T the date at 
which the payoff is received. 

To calculate the expected discounted payoff Eje" (S — K)*], we need to 
be able to generate samples of the average S. The simplest way to do this is 
to simulate the path S(t,),...,S (tm) and then compute the average along the 
path. We saw in (1.3) how to simulate S(T) given S(0); simulating S(t;+1) 


from S(t;) works the same way: 
S(tj41) = S(t;) exp (Ir — fo7](ty41 — ty) + oti — ty Z 541) (1.9) 


where Z,..., Zm are independent standard normal random variables. Given 
a path of values, it is a simple matter to calculate S and then the discounted 
payoff e7"T (S — K)*. 

The following algorithm illustrates the steps in simulating n paths of m 
transitions each. To be explicit, we use Z;; to denote the jth draw from the 
normal distribution along the ith path. The {Z;;} are mutually independent. 


OTS heran 
for 7 =La 
generate Zij 
set Silty) = $;(t)-1) exp ([r - 407|(tj - 4-1) to VG — ts) Zis) 
set © = (S;(t1) +--+: +5;(tm))/m 
set C; =e 7? (S — K)t 
set Cyn = (C1 +--+ Cn)/n 
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Figure 1.2 gives a schematic illustration of a spreadsheet implementation 
of this method. The spreadsheet has n rows of standard normal random vari- 
ables Z;; with m variables in each row. These are mapped to n paths of the 
underlying asset, each path consisting of m steps. From each path, the spread- 
sheet calculates a value of the time average 5S; and a value of the discounted 
payoff Ci. The C; are averaged to produce the final estimate Cy. 
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Fig. 1.2. A spreadsheet for estimating the expected present value of the payoff of 
an Asian call option. 


1.1.3 Efficiency of Simulation Estimators 


Much of this book is devoted to ways of improving Monte Carlo estimators. 
To discuss improvements, we first need to explain our criteria for compar- 
ing alternative estimators. Three considerations are particularly important: 
computing time, bias, and variance. 

We begin by considering unbiased estimates. The two cases considered in 
Section 1.1.2 (the standard call and the Asian call) produced unbiased esti- 
mates in the sense that in both cases E[Cn| = C, with C,, the corresponding 
estimator and C the quantity being estimated. Also, in both cases the esti- 
mator C;, was the mean of n independent and identically distributed samples. 
We proceed by continuing to consider estimators of this form because this 
setting is both simple and practically relevant. 

Suppose, then, that 

n 
a 


with C; iid., E[C;] = C and Var|C;] = oé < oo. The central limit theorem 
asserts that as the number of replications n increases, the standardized esti- 
mator (Cn — C)/(oc/./n) converges in distribution to the standard normal, 
a statement often abbreviated as 


ale 
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C. ie C 
———— > N(0,1 
oco/yn (0,1) 
or, equivalently, as ; 
J/n[C, — C] > N(0,0%). (1.10) 


Here, = denotes convergence in distribution and N (a,b?) denotes the normal 
distribution with “a a and variance b*. The stated convergence in distrib- 


ution means that -* . 
lim P m <2} = Tz) 
noo \ oo /y/n 


for all x, with ® the cumulative normal distribution. The same limit holds 
if oc is replaced with the sample standard devation sc (as in (1.5)); this is 
important because oc is rarely known in practice but sc is easily calculated 
from the simulation output. The fact that we can replace oc with sc without 
changing the limit in distribution follows from the fact that sc/oc — 1 as 
n — co and general results on convergence in distribution (cf. Appendix A). 

The central limit theorem justifies the confidence interval (1.6): as n —> 
oo, the probability that this interval straddles the true value C approaches 
1 — ô. Put differently, the central limit theorem tells us something about the 
distribution of the error in our simulation estimate: 


Cy, — C ~ N(0,02/n), 


meaning that the error on the left has approximately the distribution on the 
right. This makes precise the intuitively obvious notion that, other things 
being equal, in comparing two estimators of the same quantity we should 
prefer the one with lower variance. 

But what if other things are not equal? In particular, suppose we have a 
choice between two unbiased estimators and that the one with smaller vari- 
ance takes longer to compute. How should we balance variance reduction and 
computational effort? An informal answer was suggested by Hammersley and 
Handscomb [169]; Fox and Glynn [128] and Glynn and Whitt [160] develop 
a general framework for analyzing this issue and we now review some of its 
main conclusions. 

Suppose that generating a replication C; takes a fixed amount of comput- 
ing time T. Our objective is to compare estimators based on relative compu- 
tational effort, so the units in which we measure computing time are unim- 
portant. Let s denote our computational budget, measured in the same units 
as T. Then the number of replications we can complete given the available 
budget is |s/7|, the integer part of s/7, and the resulting estimator is C's jets 
Directly from (1.10), we get 


V [s/T] OPS = C] => N(0, 0G) 
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as the computational budget s increases to infinity. Noting that |s/7]/s — 
1/r, it follows that ,/s[C|./-; — C] is also asymptotically normal but with an 


asymptotic variance of oT; i.e., 
VslC\e/r} n C] E N (0, oT) (1.11) 


as s — oo. This limit normalizes the error in the estimator by the computing 
time s rather than by the number of replications. It tells us that, given a 
budget s, the error in our estimator will be approximately normally distributed 
with variance 047/s. 

This property provides a criterion for comparing alternative unbiased esti- 
mators. Suppose, for example, that we have two unbiased estimators both of 
which are averages of independent replications, as above. Suppose the variance 
per replication o? of the first estimator is larger than the variance per repli- 
cation o of the second estimator, but the computing times per replication 7;, 
i = 1,2, of the two estimators satisfy Tı < T2. How should we choose between 
the faster, more variable estimator and the slower, #88 variable estimator? 
The formulation of the central limit theorem in (1.11) suggests that asymp- 
totically (as the computational budget grows), we should prefer the estimator 
with the smaller value of o77;, because this is the one that will produce the 
more precise estimate (and narrower confidence interval) from the budget s. 

A feature of the product o?r (variance per replication times computer time 
per replication) as a measure of efficiency is that it is insensitive to bundling 
multiple replications into a single replication. Suppose, for example, that we 
simply redefine a replication to be the average of two independent copies of 
the original replications. This cuts the variance per replication in half but 
doubles the computing time per replication and thus leaves the product of 
the two unaltered. A purely semantic change in what we call a replication 
does not affect our measure of efficiency. 

The argument leading to the work-normalized central limit theorem (1.11) 
requires that the computing time per replication be constant. This would be 
almost exactly the case in, for example, the simulation of the Asian option con- 
sidered in Section 1.1.2: all replications require simulating the same number 
of-transitions, and the time per transition is nearly constant. This feature is 
characteristic of many derivative pricing problems in which the time per repli- 
cation is determined primarily by the number of time steps simulated. But 
there are also cases in which computing time can vary substantially across 
replications. In pricing a barrier option, for example (cf. Section 3.2.2), one 
might terminate a path the first time a barrier is crossed; the number of tran- 
sitions until this happens is typically random. Sampling through acceptance- 
rejection (as discussed in Section 2.2.2) also introduces randomness in the 
time per replication. 

To generalize (1.11) to these cases, we replace the assumption of a fixed 
computing time with the condition that (C1, 7), (C2, 72),... are independent 
and identically distributed, with C; as before and 7; now denoting the com- 
puter time required for the ith replication. The number of replications that 
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can be completed with a computing budget s is 


No =spdnz0: on ah 


i=1 


and is also random. Our estimator based on a budget s is ÔN (s), the average 
of the first N (s) replications. Our assumption of i.i.d. replications ensures 
that N(s)/s — 1/E[r] with probability one (this is the elementary renewal 
theorem) and then that (1.11) generalizes to (cf. Appendix A.1) 


Vs[Cn(s) — C] => N (0, o2Efr]). (1.12) 


This limit provides a measure of asymptotic relative efficiency when the com- 
puting time per replication is variable. It indicates that in comparing al- 
ternative estimators, each of which is the average of unbiased independent 
replications, we should,prefer the one for which the product 


(variance per replication) x (expected computing time per replication) 


is smallest. This principle (an early version of which may be found in Hammer- 
sley and Handscomb [169], p.51) is a special case of a more general formulation 
developed by Glynn and Whitt [160] for comparing the efficiency of simulation 
estimators. Their results include a limit of the form in (1.12) that holds in far 
greater generality than the case of i.i.d. replications we consider here. 


Bias 


The efficiency comparisons above, based on the central limit theorems in (1.10) 
and (1.12), rely on the fact that the estimators to be compared are aver- 
ages of unbiased replications. In the absence of bias, estimator variability and 
computational effort are the most important considerations. However, reduc- 
ing variability or computing time would be pointless if it merely accelerated 
convergence to an incorrect value. While accepting bias in small samples is 
sometimes necessary, we are interested only in estimators for which any bias 
can be eliminated through increasing computational effort. 

Some simulation estimators are biased for all finite sample sizes but be- 
come asymptotically unbiased as the number of replications increases. ‘This 
is true of Cn(s); for example. When the 7; are random, E[Cyis)] Æ C, but 
the central limit theorem (1.12) shows that the bias in this case becomes 
negligible as s increases. Glynn and Heidelberger [155] show that it can be en- 
tirely eliminated by forcing completion of at least the first replication, because 
E[Cmax{1,N(s)}| =C. 

Another example is provided by the problem of estimating a ratio of ex- 
pections E[X]/E[Y] from i.i.d. replications (X;,Y;), i = 1,...,n, of the pair 
(X,Y). The ratio of sample means X/Y is biased for all n because 
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but X/Y clearly converges to E[X]/E[Y] with probability 1 as n — oo. More- 
over, the normalized error 


va(F -E 


is asymptotically normal, a point we return to in Section 4.3.3. Thus, the bias 
becomes negligible as the number of replications increases, and the conver- 
gence rate of the estimator is unaffected. 

But not all types of bias vanish automatically in large samples — some 
require special effort. Three examples should help illustrate typical sources 
of non-negligible bias in financial engineering simulations. In each of these 
examples the bias persists as the number of replications increases, but the 
bias is nevertheless manageable in the sense that it can be made as small as 
necessary through additional computational effort. 


Example 1.1.1 Model discretization error. In Section 1.1.2 we illustrated the 
use of Monte Carlo in estimating the expected present value of the payoff of a 
standard call option and an Asian call option under Black-Scholes assumptions 
on the dynamics of the underlying stock. We obtained unbiased estimates by 
simulating the underlying stock using (1.3) and (1.9). Suppose that instead of 
using (1.9) we divide the time horizon into small increments of length h and 
approximate changes in the underlying stock using the recursion 


S((j + 1)h) = S(jh) + rSGh)h + oS(Gh)VhZp41, 


with Z1, Z2,... independent standard normal random variables. The joint 
distribution of the values of the stock price along a path simulated using 
this rule will not be exactly the same as that implied by the Black-Scholes 
dynamics in (1.1). As a consequence, the expected present value of an option 
payoff estimated using this simulation rule will differ from the exact value — 
the simulation estimator is biased. This is an example of discretization bias 
because it results from time-discretization of the continuous-time dynamics of 
the underlying model. 

Of course, in this example, the bias can be eliminated by using the exact 
method (1.9) to simulate values of the underlying stock at the relevant dates. 
But for many models, exact sampling of the continuous-time dynamics is 
infeasible and discretization error is inevitable. This is typically the case if, 
for example, the volatility parameter o is a function of the stock price S, 
as in (1.7). The resulting bias can be managed because it typically vanishes 
as the time step h decreases. However, taking h smaller entails generating 
more transitions per path (assuming a fixed time horizon) and thus a higher 
computational burden. O 
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Example 1.1.2 Payoff discretization error. Suppose that in the definition of 
the Asian option in Section 1.1.2, we replace the discrete average in (1.8) with 
a continuous average 


T 


In this case, even if we use (1.9) to generate values of S(t;) at a discrete 
set of dates ¢;, we cannot calculate S exactly — we need to use a discrete 
approximation to the continuous average. A similar issue arises in estimating, 


e.g., 


Ta Zh Sdu 


-rT 

Elen" (max, S(t) — S(T))), 
the expected present value of the payoff of a lookback option. Even if we 
simulate a path S(0), S(t1),..., S(tm) exactly (i.e., using (1.9)), the estimator 

—rT 

eT max S(t) - S(T) 
is biased; in particular, the maximum over the S(t,;) can never exceed and will 
almost surely underestimate the maximum of S(t) over all t between 0 and T. 
In both cases, the bias can be made arbitrarily small by using a sufficiently 
small simulation time step, at the expense of increasing the computational 
cost per path. Notice that this example differs from Example 1.1.1 in that 
the source of discretization error is the form of the option payoff rather than 
the underlying model; the S(t;) themselves are sampled without discretization 
error. This type of bias is less common in practice than the model discretiza- 
tion error in Example 1.1.1 because option contracts are often sensitive to the 
value of the underlying asset at only a finite set of dates. O 


Example 1.1.3 Nonlinear functions of means. Consider an option expiring 
at Tı to buy a call option expiring at Tə > Tı; this is an option on an 
option, sometimes called a compound option. Let C (2) (x) denote the expected 
discounted payoff of the option expiring at Tə conditional on the underlying 
stock price equaling x at time Tı. More explicitly, 


CO (x) = Efe") ($ (Ta) — Ka)" |S(T1) = 2] 


with Kə the strike price. If the compound option has a strike of K1, then the 
expected present value of its payoff is 


CH = Efe (C0) (S(T) — Ky)+]. 


If the dynamics of the underlying stock are described by the Black-Scholes 
model (1.1), C®) and C™ can be evaluated explicitly. But consider the 
problem of estimating C“) by simulation. To do this, we simulate n values 
Si(T1),...,Sn(T1) of the stock at Tı and then k values 5;1(T2),..., Sig (T2) 
of the stock at Tz from each S;(T;), as illustrated in Figure 1.3. We estimate 
the inner option value at S;(Tı) using 
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k 
$ 1 arl ody 
C Oe D e E (Sig (Ta) — K2)t 


j=l 


and then estimate C) using 


g Lon ona 
6M = = Ser (EP) (5,(71)) ~ Ki) 


t=1 


= Sa (T) 

S(T) , S(T) 

S (0) SxT) 
S(T) 


-= Fig. 1.3. Nested simulation used to estimate a function of a conditional expectation. 


If we replaced the inner éstimate ce) with its expectation, the result 
would be an unbiased estimator of C“). But because we estimate the inner 
expectation, the overall estimator is biased high: 


ECM) = Efer (CY (8;(T1)) ~ Ki)*] 
aea rT (ÊP (8;(T1)) -KST 
Ele 7M (ELP (S;(T1))|$:(T1)] K1) "] 
= an -*F: (G)($,(T1)) — K1)*] 

ao"), 


This follows from Jensen’s inequality and the convexity of the function y — 
(y — K,)*. As the number k of samples of S(T) generated per sample of 
S(Tı) increases, the bias vanishes because CSG 1)) => CO (Si(T1)) with 
probability one. The bias can therefore be managed, but once again only at 
the expense of increasing the computational cost per replication. 

The source of bias in this example is the application of a nonlinear function 
(in this case, the option payoff) to an estimate of an expectation. Closely 
related biases arise in at least two important applications of Monte Carlo 
in financial engineering. In measuring portfolio risk over a fixed horizon, the 
value of the portfolio at the end of the horizon is a conditional expectation. 
In valuing American options by simulation, the option payoff at each exercise 
date must be compared with the conditionally expected discounted payoff 
from waiting to exercise. These topics are discussed in Chapters 8 and 9. O 
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Examples 1.1.1-1.1.3 share some important features. In each case, the 
relevant estimator is an average of independent replications; each replication 
is biased but the bias can be made arbitrarily small at the expense of increasing 
the computational cost per replication. Given a fixed computing budget, we 
therefore face a tradeoff in allocating the budget. Expending more effort per 
replication lowers bias, but it also decreases the number of replications that 
can be completed and thus tends to increase estimator variance. 

We need a measure of estimator performance that balances bias and vari- 
ance. A standard measure is mean square error, which equals the sum of bias 
squared and variance. More explicitly, if @ is an estimator of a quantity a, 


then 


MSE(d) = E[(a — a)?] 
= (E[4] — a)” + El(a — E[â])*] 
= Bias? (â) + Variance(d). 


While exact calculation of mean square error is generally impractical, it is 
often possible to compare estimators through their asymptotic MSE. 

For simplicity, we restrict attention to estimators that are sample means 
of i.i.d. replications. Extending the notation used in the unbiased case, we 
write C (n, 6) for the average of n independent replications with parameter ô. 
This parameter determines the bias: we assume E[C(n, 5)] = as and as > a 
as ô — 0, with a the quantity to be estimated. In Examples 1.1.1 and 1.1.2, 
ô could be the simulation time increment along each path; in Example 1.1.3 
we could take 6 = 1/k. We investigate the mean square error of C(n, 6) as the 
computational budget grows. 

Under reasonable additional conditions (in particular, uniform integra- 
bility), the central limit theorem in (1.12) for the asymptotically unbiased 


estimator C'y(s) implies 


sVar[Cns)]| =? o2E[r]; 


s\/2,/Nar[Cn(s)| — oc y Eft]. (1.13) 


The power of s on the left tells us the rate at which the standard error of 
ÖN (s) (the square root of its variance) decreases, and the limit on the right 
tells us the constant associated with this asymptotic rate. We proceed to 
derive similar information in the biased case, where the asymptotic rate of 
decrease of the mean square error depends, in part, on how computational 
effort is allocated to reducing bias and variance. 

For this analysis, we need to make some assumptions about the estimator. 
Let 75 be the computer time per replication at parameter 6, which we assume 
to be nonrandom. For the estimator bias and computing time, we assume 
there are constants 7, 8 > 0, b, and c > 0 such that, as 6 — 0, 


equivalently, 


1.1 Principles of Monte Carlo 17 


as — a = bô? + o(ôf) (1.14) 
Ts = cd "+0(6—"). (1.15) 


For Examples 1.1.1-1.1.3, it is reasonable to expect that (1.15) holds with 
7 = 1 because in all three examples the work per path is roughly linear in 
1/6. The value of @ can vary more from one problem to another, but typical 
values are 1/2, 1, and 2. We will see in Chapter 6 that the value of 8 often 
depends on how one chooses to approximate a continuous-time process. 
Given a computational budget s, we can specify an allocation of this budget 
to reducing bias and variance by specifying a rule s +> 6(s) for selecting the 
parameter 6. The resulting number of replications is N(s) = [s/75 5) | and the 
resulting estimator is C(s) = C(N(s),5(s)); notice that the estimator is now 
indexed by the single parameter s whereas it was originally indexed by both 
the number of replications n and the bias parameter ô. We consider allocation 


rules 6(s) for which 
6(s) =as_%+0(s 7) (1.16) 
for some constants a,y > 0. A larger y corresponds to a smaller 6(s) and 
thus greater effort allocated to reducing bias; through (1.15), smaller 6 also 
implies greater computing time per replication, hence fewer replications and 
less effort allocated to reducing variance. Our goal is to relate the choice of y 
to the rate at which the MSE of C (s) decreases as s increases. 
For large s, we have N(s) © 8/755); (1.15) and (1.16) together imply that 
T5(s) is O(s) and hence that N(s) is O(s'~7”). A minimal requirement on 
the allocation rule 6(s) is that the number of replications N(s) increase with 


s. We therefore restrict y to be less than 1/7 so that 1 — yn > 0. 
As a step in our analysis of the MSE, we write the squared bias as 


(ass) =a)? = b°5(s)*? + 0(5(s)””) 
= 67479 5~2P7 + o(3~ 297) (1.17) 
Oe 772) (1.18) 


using (1.14) and (1.16). 
Next we consider variance. Let o? denote the variance per replication at 
parameter ô. Then 
2 
Om 


Ls/tas)} 


We assume that of approaches a finite limit o? > 0 as 6 — 0. This is a natural 
assumption in the examples of this section: in Examples 1.1.1 and 1.1.2, a? is 
the variance in the continuous-time limit; in Example 1.1.3, it is the variance 
that remains from the first simulation step after the variance in the second 
step is eliminated by letting k — oo. Under this assumption we have 


Var[C(s)] = 


Var[C(s)] = LO + 0(r5(s)/s). 
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Combining this expression for the variance with (1.15) and (1.16), we get 


Var[C(s)] = a T + o(6(s)~"/s) 
= g°ca "sT! + o(s!) (1.19) 
OT, (1.20) 


The order of magnitude of the MSE is the sum of (1.18) and (1.20). 
Consider the effect of different choices of y. If 28y > 1 — yn then the 
allocation rule drives the squared bias (1.18) to zero faster than the variance 
(1.20), so the MSE is eventually dominated by the variance. Conversely, if 
20y < 1 — yn then for large s the MSE is dominated by the squared bias. An 
optimal allocation rule selects y to balance the two terms. Setting 26y = 1—yn 
means taking y = 1/(28 + 7). Substituting this back into (1.17) and (1.19) 


results in 
MSE(C(s)) = (b?a?? + 02ca~7)s~29/2P +) 4 9(g~28/ (28-41) ) (1.21) 


and thus for the root mean square error we have 


RMSE(C(s)) = 1/ MSE(C(s)) = O(s78/@8+)), (1.22) 


The exponent of s in this approximation gives the convergence rate of the 
RMSE and should be contrasted with the convergence rate of s71? in (1.13). 
By minimizing the coefficient in (1.21) we can also find the optimal parameter 
a in the allocation rule (1.16), 


alee 
noc 2B+7 
_ (sa) ) 


but this is of less immediate practical value than the convergence rate in 
(1.22). 

A large @ corresponds to a rapidly vanishing bias; as 86 —-œœ we have 
B/(28 +n) — 1/2, recovering the convergence rate of the standard error in 
the unbiased case. Similarly, when 7 is small it follows from (1.16) that the 
computational cost of reducing bias is small; in the limit as n — 0 we again 
get 6/(28 +n) — 1/2. But for any finite 6 and positive n, (1.22) shows that 
we must expect a slower convergence rate using an estimator that is unbiased 
only asymptotically compared with one that is unbiased. 

Under an allocation rule satisfying (1.16), taking y = 1/(28 + 7) implies 
that the bias parameter ô should decrease rather slowly as the computational 
budget increases. Consider, for instance, bias resulting from model discretiza- 
tion error as in Example 1.1.1. In this setting, interpreting 6 as the simulation 
time increment, the values 8 = 7 = 1 would often apply, resulting in y = 1/3. 
Through (1.16), this implies that the time increment should be cut in half 
with an eight-fold increase in the computational budget. 
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In applications of Monte Carlo to financial engineering, estimator vari- 
ance is typically larger than (squared) bias. With a few notable exceptions 
(including the pricing of American options), it is generally easier to imple- 
ment a simulation with a comfortably small bias than with a comfortably 
small standard error. (For example, it is often difficult to measure the reduc- 
tion in discretization bias achieved using the methods of Chapter 6 because 
the bias is overwhelmed by simulation variability.) This is consistent with the 
rather slow decrease in 6(s) recommended by the analysis above, but it may 
also in part reflect the relative magnitudes of the constants b, c, and ø. These 
constants may be difficult to determine; the order of magnitude in (1.21) can 
nevertheless provide useful insight, especially when very precise simulation 
results are required, for which the limit s — oo is particularly relevant. 

The argument above leading to (1.21) considers only the convergence of 
the mean square error. Glynn and Whitt [160] analyze asymptotic efficiency 
through the convergence rate of the limit in distribution of simulation estima- 
tors. Under uniform integrability conditions, a convergence rate in distribution 
implies a convergence rate for the MSE, but the limiting distribution also pro- 
vides additional information, just as the central limit theorem (1.12) provides 
information beyond (1.13). 


1.2 Principles of Derivatives Pricing 


The mathematical theory of derivatives pricing is both elegant and remarkably 
practical. A proper development of the theory and of the tools needed even to 
state precisely its main results requires a book-length treatment; we therefore 
assume familiarity with at least the basic ideas of mathematical finance and 
refer the reader to Björk [48], Duffie [98], Hunt and Kennedy [191], Lamberton 
and Lapeyre [218], and Musiela and Rutkowski [275] for further background. 
We will, however, highlight some principles of the theory, especially those that 
bear on the applicability of Monte Carlo to the calculation of prices. ‘Three 
ideas are particularly important: 


1. If a derivative security can be perfectly replicated (equivalently, hedged) 
through trading in other assets, then the price of the derivative security 
is the cost of the replicating trading strategy. 

2. Discounted (or deflated) asset prices are martingales under a probabil- 
ity measure associated with the choice of discount factor (or numeraire). 
Prices are expectations of discounted payoffs under such a martingale 
measure. 

3. In a complete market, any payoff (satisfying modest regularity conditions) 
can be synthesized through a trading strategy, and the martingale measure 
associated with a numeraire is unique. In an incomplete market there are 
derivative securities that cannot be perfectly hedged; the price of such a 
derivative is not completely determined by the prices of other assets. 
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The rest of this chapter is devoted to explaining these principles and to 
developing enough of the underlying theory to indicate why, leaving technical 
issues aside, they ought to be true. A reader familiar with or uninterested in 
this background may want to skip to the recipe in Figure 1.4, with a warning 
that the overly simplified summary given there is at best a starting point for 
applying Monte Carlo to pricing. 

The first of the principles above is the foundation of an industry. Financial 
intermediaries can sell options to their clients and then eliminate the risk from 
the resulting short position in the option through trading in other assets. 
They need to charge what it costs to implement the trading strategy, and 
competition ensures that they cannot charge (much) more. Their clients could 
in principle run the replicating trading strategy themselves instead of buying 
options, but financial institutions are better equipped to do this and can do 
it at lower cost. This role should be contrasted with that of the insurance 
industry. Insurers bear risk; derivative dealers transfer it. 

The second principle is the main link between pricing and Monte Carlo. 
The first principle gives us a way of thinking about what the price of a deriv- 
ative security ought to be, but it says little about how this price might be 
evaluated — it leaves us with the task of finding a hedging strategy and then 
determining the cost of implementing this strategy. But the second principle 
gives us a powerful shortcut because it tells us how to represent prices as ex- 
pectations. Expectations (and, more generally, integrals) lend themselves to 
evaluation through Monte Carlo and other numerical methods. The subtlety 
in this approach lies in the fact that we must describe the dynamics of asset 
prices not as we observe them but as they would be under a risk-adjusted 
probability measure. 

The third principle may be viewed as describing conditions under which 
the price of a derivative security is determined by the prices of other assets so 
that the first and second principles apply. A complete market is one in which 
all risks can be perfectly hedged. If all uncertainty in a market is generated 
by independent Brownian motions, then completeness roughly corresponds to 
the requirement that the number of traded assets be at least as large as the 
number of driving Brownian motions. Jumps in asset prices will often render a 
model incomplete because it may be impossible to hedge the effect of discon- 
tinuous movements. In an incomplete market, prices can still be represented 
as expectations in substantial generality, but the risk adjustment necessary 
for this representation may not be uniquely determined. In this setting, we 
need more economic information — an understanding of investor attitudes 
towards risk — to determine prices, so the machinery of derivatives pricing 
becomes less useful. 

A derivative security introduced into a complete market is a redundant 
asset. It does not expand investment opportunities; rather, it packages the 
trading strategy (from the first principle above) investors could have used 
anyway to synthesize the security. In this setting, pricing a derivative (using 
the second principle) may be viewed as a complex form of interpolation: we 
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use a model to determine the price of the derivative relative to the prices of 
other assets. On this point, mathematical theory and industry practice are 
remarkably well aligned. For a financial institution to create a new derivative 
security, it must determine how it will hedge (or synthesize) the security by 
trading in other, more liquid assets, and it must determine the cost of this 
trading strategy from the prices of these other assets. 


1.2.1 Pricing and Replication 


To further develop these ideas, we consider an economy with d assets whose 
prices S;(t), 7 = 1,...,d, are described by a system of SDEs 
a5.) 
S;(t) 


= yi (S(t), t) dt + o;(S(t), t)! dw°(t), (1.23) 


with W° a k-dimensional Brownian motion, each c; taking values in R*, and 
each u; scalar-valued. We assume that the u; and o; are deterministic func- 
tions of the current state S(t) = (S,(t),...,Sg(t))' and time t, though the 
general theory allows these coefficients to depend on past prices as well. (See 
Appendix B for a brief review of stochastic differential equations and refer- 
ences for further background.) Let 


jp SO Oe. Of FA hena (1.24) 


this may be interpreted as the covariance between the instantaneous returns 
on assets 7 and 7. 

A portfolio is characterized by a vector 0 € R? with 6; representing the 
number of units held of the zth asset. Since each unit of the ith asset is worth 
S;(t) at time t, the value of the portfolio at time t is 


6,5 (t) +--+ basalt), 


which we may write as 0' S(t). A trading strategy is characterized by a sto- 
chastic process @(t) of portfolio vectors. To be consistent with the intuitive 
notion of a trading strategy, we need to restrict 0(t) to depend only on infor- 
mation available at t; this is made precise through a measurability condition 
(for example, that 0 be predictable). 

If we fix the portfolio holdings at 6(t) over the interval [t,t + A], then the 
change in value over this interval of the holdings in the zth asset is given by 
6,(t)[S;(t + h) — S;(t)]; the change in the value of the portfolio is given by 
o(t)'[S(t+h) —S(t)]. This suggests that in the continuous-time limit we may 
describe the gains from trading over [0,t] through the stochastic integral 


[ Olu)" dS (u), 


subject to regularity conditions on S and 0. Notice that we allow trading of ar- 
bitrarily large or small, positive or negative quantities of the underlying assets 
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continuously in time; this is a convenient idealization that ignores constraints 


on real trading. 
A trading strategy is self-financing if it satisfies 


OAT S(t) — 6(0)' S(0) = / olu)" dS(u) (1.25) 


for all t. The left side of this equation is the change in portfolio value from time 
0 to time t and the right side gives the gains from trading over this interval. 
Thus, the self-financing condition states that changes in portfolio value equal 
gains from trading: no gains are withdrawn from the portfolio and no funds 


are added. By rewriting (1.25) as 


t 
OT S(t) = 0(0)' S(0) + A(u)! dS(u), 
0 
we can interpret it as stating that from an initial investment of V (0) = 
0(0)! S(0) we can achieve a portfolio value of V(t) = @(t)' S(t) by follow- 
ing the strategy 0 over [0, t]. 

Consider, now, a derivative security with a payoff of f(S(T)) at time T; this 
could be a standard European call or put on one of the d assets, for example, 
but the payoff could also depend on several of the underlying assets. Suppose 
that the value of this derivative at time t, 0 < t < T, is given by some function 
V(S(t),t). The fact that the dynamics in (1.23) depend only on (S(t), t) makes 
it at least plausible that the same might be true of the derivative price. If we 
further conjecture that V is a sufficiently smooth function of its arguments, 
It6’s formula (see Appendix B) gives 


do at ; : 


: 2 Uw), U 
+H SSW) E(u), SE du, (1.26) 


with X as in (1.24). If the value V(S(t),t) can be achieved from an initial 
wealth of V(.S(0),0) through a self-financing trading strategy 6, then we also 
have 


d pt 
V(S(t),t) = V(S(0),0) + > / 6;(u) dS;(u). (1.27) 
i=1 “0 
Comparing terms in (1.26) and (1.27), we find that both equations hold if 


au) — KEDD, gag 4.28 


and 
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OV(S, u) 


ƏV(S,u) É 
1 Lil S, u)S,;S; a = 0. 1.2 


ðu 


Since we also have V (S(t), t) = 0! (t)S(t), (1.28) implies 
d 
VEs Tg (1.30) 


Finally, at t = T we must have 
V(8,T) = f(S) (1.31) 


if V is indeed to represent the value of the derivative security. 

Equations (1.29) and (1.30), derived here following the approach in Hunt 
and Kennedy [191], describe V through a partial differential equation (PDE) 
with boundary condition (1.31). Suppose we could find a solution V(S,t). In 
what sense would we be justified in calling this the price of the derivative 
security? 

By construction, V satisfies (1.29) and (1.30), and then (1.26) implies 
that the (assumed) self-financing representation (1.27) indeed holds with the 
trading strategy defined by (1.28). Thus, we may sell the derivative security for 
V(S(0), 0) at time 0, use the proceeds to implement this self-financing trading 
strategy, and deliver the promised payoff of f(S(T), T) = V(S(T), T) at time 
T with no risk. If anyone were willing to pay more than V(S(0), 0), we could 
sell the derivative and be guaranteed a riskless profit from a net investment of 
zero; if anyone were willing to sell the derivative for less than V(S(0),0), we 
could buy it, implement the strategy —0(t), and again be ensured a riskless 
profit without investment. Thus, V(S(0),0) is the only price that rules out 
riskless profits from zero net investment. 

From (1.30) we see that the trading strategy that replicates V holds 
OV (S;,t)/OS; shares of the ith underlying asset at time t. This partial deriv- 
ative is the delta of V with respect to S; and the trading strategy is called 
delta hedging. 

Inspection of (1.29) and (1.30) reveals that the drift parameters py; in the 
asset price dynamics (1.23) do not appear anywhere in the partial differen- 
tial equation characterizing the derivative price V. This feature is sometimes 
paraphrased through the statement that the price of a derivative does not 
depend on the drifts of the underlying assets; it would be more accurate to 
say that the effect of the drifts on the price of a derivative is already reflected 
in the underlying asset prices S; themselves, because V depends on the S; 
and the S; are clearly affected by the ui. 

The drifts of the underlying asset prices reflect investor attitudes toward 
risk. In a world of risk-averse investors, we may expect riskier assets to grow at 
a higher rate of return, so larger values of o;; should be associated with larger 
values of u;. In a world of risk-neutral investors, all assets should grow at the 
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same rate — investors will not demand higher returns for riskier assets. The 
fact that the u; do not appear in the equations for the derivative price V may 
therefore be interpreted as indicating that we can price the derivative without 
needing to know anything about investor attitudes toward risk. ‘This relies 
critically on the existence of a self-financing trading strategy that replicates 
V: because we have assumed that V can be replicated by trading in the 
underlying assets, risk preferences are irrelevant; the price of the derivative is 
simply the minimal initial investment required to implement the replicating 


strategy. 


Black-Scholes Model 


As an illustration of the general formulation in (1.29) and (1.30), we consider 
the pricing of European options in the Black-Scholes model. The model con- 
tains two assets. The first (often interpreted as a stock price) is risky and its 
dynamics are represented through the scalar SDE 

dS'(t) 

—— = pdt+adWw°(t 

ay hee (t) 
with W° a one-dimensional Brownian motion. The second asset (often called 
a savings account or a money market account) is riskless and grows deter- 
ministically at a constant, continuously compounded rate r; its dynamics are 


given by 

AG(t 

Wa) = de: 

Bt) 
Clearly, G(t) = 6(0)e"’ and we may assume the normalization (0) = 1. We 
are interested in pricing a derivative security with a payoff of f(.S(7')) at time 
T. For example, a standard call option pays (S(7')— K)*, with K a constant. 

If we were to formulate this model in the notation of (1.23), X would 

be a 2 x 2 matrix with only one nonzero entry, o?. Making the appropriate 
substitutions, (1.29) thus becomes 


Equation (1.30) becomes 
OV OV 
VS Ots ago ape (1.33) 


These equations and the boundary condition V(S, 6,7) = f(.S) determine the 
price V. 

This formulation describes the price V as a function of the three variables 
S, 6, and t. Because @ depends deterministically on t, we are interested in 
values of V only at points (S, 3,t) with 8 = e™. This allows us to eliminate 
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one variable and write the price as V(S,t) = V(S,e™,t), as in Hunt and 
Kennedy [191]. Making this substitution in (1.32) and (1.33), noting that 


av _ OV OV 
at ap at 


and simplifying yields 
: z 7 i 
— +rS— + to S a -rV =0. 


This is the Black-Scholes PDE characterizing the price of a European deriv- 
ative security. For the special case of the boundary condition V(S,T) = 
(S — K)", the solution is given by V(S,t) = BS(S,o,T — t,r, K), the Black- 
Scholes formula in (1.4). 


1.2.2 Arbitrage and Risk-Neutral Pricing 


The previous section outlined an argument showing how the existence of a 
self-financing trading strategy that replicates a derivative security determines 
the price of the derivative security. Under assumptions on the dynamics of 
the underlying assets, this argument leads to a partial differential equation 
characterizing the price of the derivative. 

Several features may, however, limit the feasibility of calculating derivative 
prices by solving PDEs. If the asset price dynamics are sufficiently complex, a 
PDE characterizing the derivative price may be difficult to solve or may even 
fail to exist. If the payoff of a derivative security depends on the paths of the 
underlying assets and not simply their terminal values, the assumption that 
the price can be represented as a function V(S,t) generally fails to hold. If 
the number of underlying assets required by the replicating strategy is large 
(greater than two or three), numerical solution of the PDE may be impractical. 
These are precisely the settings in which Monte Carlo simulation is likely to 
be most useful. However, to apply Monte Carlo we must first find a more 
convenient representation of derivative prices. In particular, we would like 
to represent derivative prices as expectations of random objects that we can 
simulate. This section develops such representations. 


Arbitrage and Stochastic Discount Factors 


We return to the general setting described by the asset price dynamics in 
(1.23), for emphasis writing P, for the probability measure under which these 
dynamics are specified. (In particular, the process W° in (1.23) is a standard 
Brownian motion under P,.) The measure P, is intended to describe objective 
(“real-world”) probabilities and the system of SDEs in (1.23) thus describes 
the empirical dynamics of asset prices. 
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Recall the definition of a self-financing trading strategy 0(t) as given in 
(1.25). A self-financing trading strategy 0(t) is called an arbitrage if either of 
the following conditions holds for some fixed time t: 


(i) 6(0)'S(0) < 0 and P,(0(t)' S(t) > 0) = 1; 
(ii) 0(0) '.S(0) = 0, P,(0(t)' S(t) > 0) = 1, and P,(@(t)' S(t) > 0) > 0. 


In (i), 0 turns a negative initial investment into nonnegative final wealth with 
probability 1. In (ii), 0 turns an initial net investment of 0 into nonnegative fi- 
nal wealth that is positive with positive probability. Each of these corresponds 
to an opportunity to create something from nothing and is incompatible with 
economic equilibrium. Precluding arbitrage is a basic consistency requirement 
on the dynamics of the underlying assets in (1.23) and on the prices of any 
derivative securities that can be synthesized from these assets through self- 
financing trading strategies. 

Call a process V(t) an attainable price process if V(t) = O(t)! S(t) for 
some self-financing trading strategy 0. Thus, a European derivative security 
can be replicated by trading in the underlying assets precisely if its payoff at 
expiration T coincides with the value V (T) of some attainable price process 
at time T. Each of the underlying asset prices S;(t) in (1.23) is attainable 
through the trivial strategy that sets 6; = 1 and 6; = 0 for all 7 Æ i. 

We now introduce an object whose role may at first seem mysterious but 
which is central to asset pricing theory. Call a strictly positive process Z(t) a 
stochastic discount factor (or a deflator) if the ratio V(t)/Z(t) is a martingale 
for every attainable price process V(t); i.e., if 


V(t) V(T) 
za) = © | zeny ai 


whenever t < T. Here, Eo denotes expectation under P, and F; represents 
the history of the Brownian motion W° up to time t. We require that Z(t) be 
adapted to F;, meaning that the value of Z(t) is determined by the history of 
the Brownian motion up to time t. Rewriting (1.34) as 

Z(t) 


Vit) SE, er aa (1.35) 


explains the term “stochastic discount factor”: the price V(t) is the expected 
discounted value of the price V(T) if we discount using Z(t)/Z(T). (It is 
more customary to refer to 1/Z(t) rather than Z(t) as the stochastic discount 
factor, deflator, or pricing kernel; our use of the terminology is nonstandard 
but leads to greater symmetry when we discuss numeraire assets.) Notice 
that any constant multiple of a stochastic discount factor is itself a stochastic 
discount factor so we may adopt the normalization 7(0) = 1. Equation (1.35) 
then specializes to 


V(0) =E, p | (1.36) 
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Suppose, for example, that V(t) represents the price at time t of a call 
option on the zth underlying asset with strike price K and expiration 7’. Then 
V(T) = (S;(L) — K)*; in particular, V is a known function of S; at time 
T. Equation (1.36) states that the terminal value V (T) determines the initial 
value V(0) through stochastic discounting. 

We may think of (1.36) as reflecting two ways in which the price V(0) 
differs from the expected payoff E,|V(T)]. The first results from “the time 
value of money”: the payoff V(T) will not be received until T, and other 
things being equal we assume investors prefer payoffs received sooner rather 
than later. The second results from attitudes toward risk. In a world of risk- 
averse investors, risky payoffs should be more heavily discounted in valuing 
a security; this could not be accomplished through a deterministic discount 
factor. 

Most importantly for our purposes, the existence of a stochastic discount 
factor rules out arbitrage. If 0 is a self-financing trading strategy, then the 
process (t)! S(t) is an attainable price process and the ratio @(t) ' $(t)/Z(t) 
must be a martingale. In particular, then, 


i o(T)' S(T) 
8(0)75(0) = Eo ERD], 
as in (1.36). Compare this with conditions (i) and (ii) above for an arbitrage, 
recalling that Z is nonnegative. If 6(T)' S(T) is almost surely positive, it is 
impossible for 6(0) ' 5(0) to be negative; if 0(T)' S(T) is positive with positive 
probability and almost surely nonnegative, then 6(0)'S(0) = 0 is impossible. 
Thus, there can be no arbitrage if the attainable price processes admit a 
stochastic discount factor. 

It is less obvious that the converse also holds: under a variety of technical 
conditions on asset price dynamics and trading strategies, it has been shown 
that the absence of arbitrage implies the existence of a stochastic discount 
factor (or the closely related concept of an equivalent martingale measure). 
We return to this point in Section 1.2.4. The equivalence of no-arbitrage to 
the existence of a stochastic discount factor is often termed the Fundamental 
Theorem of Asset Pricing, though it is not a single theorem but rather a 
body of results that apply under various sets of conditions. An essential early 
reference is Harrison and Kreps [170]; for further background and results, see 
Duffie [98] and Musiela and Rutkowski [275]. 


Risk-Neutral Pricing 


Let us suppose that among the d assets described in (1.23) there is one that is 
risk-free in the sense that its coefficients o;; are identically zero. Let us further 
assume that its drift, which may be interpreted as a riskless interest rate, is a 
constant r. As in our discussion of the Black-Scholes model in Section 1.2.1, 
we denote this asset by G(t) and refer to it as the money market account. Its 
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dynamics are given by the equation dG(t)/G(t) = rdt, with solution (t) = 
(0) exp(rt); we fix G(0) at 1. 

Clearly, G(t) is an attainable price process because it corresponds to the 
trading strategy that makes an initial investment of 1 in the money market 
account and continuously reinvests all gains in this single asset. Accordingly, 
if the market admits a stochastic discount factor Z(t), the process G(t)/Z(t) 
is a martingale. This martingale is positive because both G(t) and Z(t) are 
positive, and it has an initial value of G(0)/Z(0) = 1. 

Any positive martingale with an initial value 1 defines a change of prob- 
ability measure. For each fixed interval |0, T], the process G(t)/Z(t) defines a 
new measure Pg through the Radon-Nikodym derivative (or likelihood ratio 


process ) 
(E) UU geret (1.37) 
t 


dP, Z(t)’ 
More explicitly, this means (cf. Appendix B.4) that for any event A € F, 


P(A) = Eo E (a) = E, (14 a 


where 14 denotes the indicator of the event A. Similarly, expectation under 
the new measure is defined by 


Ee. | aa (1.38) 


for any nonnegative X measurable with respect to F;. The measure Pg is 
called the risk-neutral measure; it is equivalent to P, in the sense of measures, 
meaning that P(A) = 0 if and only if P,(A) = 0. (Equivalent probability 
measures agree about which events are impossible.) The risk-neutral measure 
is a particular choice of equivalent martingale measure. 

Consider, again, the pricing equation (1.36). In light of (1.38), we may 
rewrite it as 


aT =e "TEV (T). . (1.39) 


This simple transformation is the cornerstone of derivative pricing by Monte 
Carlo simulation. Equation (1.39) expresses the current price V(0) as the 
expected present value of the terminal value V (T) discounted at the risk-free 
rate r rather than through the stochastic discount factor Z. The expectation in 
(1.39) is taken with respect to Pg rather than P,, so estimating the expectation 
by Monte Carlo entails simulating under Pg rather than P,. These points are 
crucial to the applicability of Monte Carlo because 


o the dynamics of Z(t) are generally unknown and difficult to model (since 
they embody time and risk preferences of investors); 

o the dynamics of the underlying asset prices are more easily described under 
the risk-neutral measure than under the objective probability measure. 
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The second point requires further explanation. Equation (1.39) generalizes 


to 

B(t) 
V(t) = Eg Mnr s (bcd; (1.40) 
with V(t) an attainable price process. In particular, then, since each S;(t) is 
an attainable price process, each ratio S;(t)/@(t) is a martingale under Pz. 
Specifying asset price dynamics under the risk-neutral measure thus entails 
specifying dynamics that make the ratios 5;(t)/G(t) martingales. If the dy- 
namics of the asset prices in (1.23) could be expressed as 


dS; (t) 


5:(t) = r dt + o:( S(t), t)" dW (¢), (1.41) 


with W a standard k-dimensional Brownian motion under Pg, then 


($2) - (88) eso.or aro 


so 5;(t)/@(t) would indeed be a martingale under P3. Specifying a model of the 
form (1.41) is simpler than specifying the original equation (1.23) because all 
drifts in (1.41) are set equal to the risk-free rate r: the potentially complicated 
drifts in (1.23) are irrelevant to the asset price dynamics under the risk-neutral 
measure. Indeed, this explains the name “risk-neutral.” In a world of risk- 
neutral investors, the rate of return on risky assets would be the same as the 


risk-free rate. 
Comparison of (1.41) and (1.23) indicates that the two are consistent if 


dW (t) = dW°(t) + v(t) dt 
for some v satisfying pi = r+ o; v, Tela (1.42) 


because making this substitution in (1.41) yields 


dS; (t) 
S(t) 


= rdt +o;(S(t),t)' [dW°(t) + v(t) dt] 


= (r + oil S(t), t) 'v(t)) dt + o;(S(t),t)' dW°(t) 
= p4(5(t), t) dt + ilS (t), t) dW? (E), 


as in (1.23). The condition in (1.42) states that the objective and risk-neutral 
measures are related through a change of drift in the driving Brownian motion. 
It follows from the Girsanov Theorem (see Appendix B) that any measure 
equivalent to P, must be related to P, in this way. In particular, the diffusion 
terms o;; in (1.41) and (1.23) must be the same. This is important because it 
ensures that the coefficients required to describe the dynamics of asset prices 
under the risk-neutral measure Pg can be estimated from data observed under 
the real-world measure P}. 
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We now briefly summarize the pricing of derivative securities through 
the risk-neutral measure with Monte Carlo simulation. Consider a deriva- 
tive security with a payoff at time T specified through a function f of the 
prices of the underlying assets, as in the case of a standard call or put. To 
price the derivative, we model the dynamics of the underlying assets under 
the risk-neutral measure, ensuring that discounted asset prices are martin- 
gales, typically through choice of the drift. The price of the derivative is then 
given by Egle~"? f(S(T))]. To evaluate this expectation, we simulate paths 
of the underlying assets over the time interval [0,7], simulating according to 
their risk-neutral dynamics. On each path we calculate the discounted pay- 
off e~™” f(S(T)); the average across paths is our estimate of the derivative’s 
price. Figure 1.4 gives a succinct statement of these steps, but it should be 
clear that especially the first step in the figure is an oversimplification. 


Monte Carlo Recipe for Cookbook Pricing 


o replace drifts u; in (1.23) with risk-free interest rate and simulate paths; 


o calculate payoff of derivative security on each path; 
o discount payoffs at the risk-free rate; 
o calculate average over paths. 


Fig. 1.4. An overly simplified summary of risk-neutral pricing by Monte Carlo. 


Black-Scholes Model 


To illustrate these ideas, consider the pricing of a call option on a stock. 
Suppose the real-world dynamics of the stock are given by 


a = 11(S(t),t) dt + o dW°(t), 


with W°? a standard one-dimensional Brownian motion under P, and o a 
constant. Each unit invested in the money market account at time 0 grows to 
a value of 3(t) = e” at time t. Under the risk-neutral measure Pg, the stock 


price dynamics are given by 


dS(t) _ 
S(t) = r dt + o dW (t) 


with W a standard Brownian motion under Pg. This implies that 


S(T) = S(O)e 787TH WT) 


If the call option has strike K and expiration T, its price at time 0 is given by 
EgleT (S(T) —K)*]. Because W (T) is normally distributed, this expectation 
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can be evaluated explicitly and results in the Black-Scholes formula (1.4). 
In particular, pricing through the risk-neutral measure produces the same 
result as pricing through the PDE formulation in Section 1.2.1, as it must 
since in both cases the price is determined by the absence of arbitrage. This 
also explains why we are justified in equating the expected discounted payoff 
calculated in Section 1.1.2 with the price of the option. 


Dividends 


Thus far, we have implicitly assumed that the underlying assets S; do not pay 
dividends. ‘This is implicit, for example, in our discussion of the self-financing 
trading strategies. In the definition (1.25) of a self-financing strategy 0, we 
interpret 0;(w) dS;(w) as the trading gains from the ith asset over the time 
increment du. This, however, reflects only the capital gains resulting from 
the change in price in the tth asset. If each share pays dividends at rate 
dD;(u) over du, then the portfolio gains would also include terms of the form 

In the presence of dividends, a simple strategy of holding a single share 
of a single asset is no longer self-financing, because it entails withdrawal of 
the dividends from the portfolio. In contrast, a strategy that continuously 
reinvests all dividends from an asset back into that asset is self-financing in 
the sense that it involves neither the withdrawal nor addition of funds from the 
portfolio. When dividends are reinvested, the number of shares held changes 
over time. 

These observations suggest that we may accommodate dividends by re- 
defining the original assets to include the reinvested dividends. Let 5;(t) be 
the th asset price process with dividends reinvested, defined through the 


requirement 


dS;(t)  dS;(t) + dD,(t) 
46° 30° (1.43) 


The expression on the right is the instantaneous return on the 2th original 
asset, including both capital gains and dividends; the expression on the left 
is the instantaneous return on the ith new asset in which all dividends are 
reinvested. For S; to support this interpretation, the two sides must be equal. 

The new assets S; pay no dividends so we may apply the ideas developed 
above in the absence of dividends to these assets. In particular, we may rein- 
terpret the asset price dynamics in (1.23) as applying to the S; rather than 
to the original S;. One consequence of this is that the S; will have continuous 
paths, so any discontinuities in the cumulative dividend process D; must be 
offset by the original asset price S;. For example, a discrete dividend corre- 
sponds to a positive jump in D; and this must be accompanied by an offsetting 
negative jump in Sj. 

For purposes of derivative pricing, the most important point is that the 
martingale property under the risk-neutral measure applies to 5;(t)/G(t) 
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rather than S;(t)/G(t). This affects how we model the dynamics of the S; 
under Pg. Consider, for example, an asset paying a continuous dividend yield 
at rate 6, meaning that dD,(t) = 5S;(t) dt. For e~"'S;(t) to be a martingale, 
we require that the dt coefficient in dS;(t)/S;(t) be r. Equating dt terms on 
the two sides of (1.43), we conclude that the coefficient on dt in the equation 
for dS;(t)/S;(t) must be r — 6. Thus, in modeling asset prices under the risk- 
neutral measure, the effect of a continuous dividend yield is to change the 


drift. The first step in Figure 1.4 is modified accordingly. 
As a specific illustration, consider a version of the Black-Scholes model in 


which the underlying asset has dividend yield 6. The risk-neutral dynamics of 
the asset are given by 
dS (t) 


S(t) = (r — ô) dt + o dW (t) 


with solution A 
S(t) Z ae a TAN 


The price of a call option with strike K and expiration T is given by the 
expectation Egle~"? (S(T) — K)*], which evaluates to 


eT S(0)B(d) eT KO(d—oVT), d= EOI +? Mee dna 


(1.44) 


with ® the cumulative normal distribution. 


1.2.3 Change of Numeraire 


The risk-neutral pricing formulas (1.39) and (1.40) continue to apply if the 
constant risk-free rate r is replaced with a time-varying rate r(t), in which 
case the money market account becomes 


B(t) = exp ( i r(u) du) 


and the pricing formula becomes 


V(t) = Eg a (-/ r(u) au) VDA ; 


The risk-neutral dynamics of the asset prices now take the form 
dSi(t) 
Si(t) 


with W a standard k-dimensional Brownian motion under Pg. Subject only 
to technical conditions, these formulas remain valid if the short rate r(t) is a 


stochastic process. 


r(t) dt + o;(S(t),t)' dW(t), 
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Indeed, our choice of (t) as the asset through which to define a new 
probability measure in (1.38) was somewhat arbitrary. This choice resulted in 
pricing formulas with the appealing feature that they discount payoffs at the 
risk-free rate; it also resulted in a simple interpretation of the measure Pg as 
risk-neutral in the sense that all assets grow at the risk-free rate under this 
measure. Nevertheless, we could just as well have chosen a different asset as 
numeraire, meaning the asset relative to which all others are valued. As we 
explain next, all choices of numeraire result in analogous pricing formulas and 
the flexibility to change the numeraire is a useful modeling and computational 
tool. 

Although we could start from the objective measure P, as we did in Sec- 
tion 1.2.2, it may be simpler to start from the risk-neutral measure Pg, espe- 
cially if we assume a constant risk-free rate r. Choosing asset Sg as numeraire 
means defining a new probability measure Ps, through the likelihood ratio 
process (Radon-Nikodym derivative) 


(FE) = Sa(t) Sa(0) 
dPs J, B(t) / BO) 
Recall that Sq(t)/G(t) is a positive martingale under Pg; dividing it by its ini- 


tial value produces a unit-mean positive martingale and thus defines a change 
of measure. Expectation under Ps, is given by 


Eyes x h = E ce ol 


for nonnegative X € F. The pricing formula (1.39) thus implies (recalling 
that G(0) = 1) 


V(O) = Ep [ER] = SOEs. | an] (1.45) 


Equation (1.40) similarly implies 


V(T) 
V(O) = Sas, | EF (1.46) 
Thus, to price under Ps,, we discount the terminal value V(T') by dividing 
by the terminal value of the numeraire and multiplying by the current value 
of the numeraire. 
Some examples should help illustrate the potential utility of this trans- 
formation. Consider, first, an option to exchange one asset for another, with 
payoff ($,(T) — S2(T))* at time T. The price of the option is given by 


e~"* Eal(S1(L) — $2(T))*] 


but also by 
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a. + 
S4(0}Es, | EAE” | L s4(0) ss [ST/ST] - 1)*]. 
S2(T) 

The expression on the right looks like the price of a standard call option on 
the ratio of the two assets with a strike of 1; it reveals that the price of the 
exchange option is sensitive to the dynamics of the ratio but not otherwise to 
the dynamics of the individual assets. In particular, if the ratio has a constant 
volatility (a feature invariant under equivalent changes of measure), then the 
option can be valued through a variant of the Black-Scholes formula due to 
Margrabe [247]. 

Consider, next, a call option on a foreign stock whose payoff will be con- 
verted to the domestic currency at the exchange rate prevailing at the expi- 
ration date T. Letting Sı denote the stock price in the foreign currency and 
letting S2 denote the exchange rate (expressed as number of domestic units per 
foreign unit), the payoff (in domestic currency) becomes S2(T)(S1(T) — K)* 
with price 

e~™ Ea[S2(T)(S1(T) — K)*]. 
Each unit of foreign currency earns interest at a risk-free rate rf and this acts 
like a continuous dividend yield. Choosing S2(t) = e’f'S2(t) as numeraire, we 


may express the price as 
et SOEs (oa i"), 


noting that $2(0) = S2(0). This expression involves the current exchange rate 
S2(0) but not the unknown future rate S2(T). 

The flexibility to change numeraire can be particularly valuable in a model 
with stochastic interest rates, so our last example applies to this setting. 
Consider an interest rate derivative with a payoff of V (T) at time T. Using 
the risk-neutral measure, we can express its price as 


V(0) = Eg es (- / ai) a vo) 


The forward measure for maturity Tp is the measure associated with taking 
as numeraire a zero-coupon bond maturing at Tp with a face value of 1. We 
denote the time-t value of the bond by B(t, Tr) (so B(Tr,Tr) = 1) and the 
associated measure by Prp. Using this measure, we can write the price as 


V (T) 
V(0) = B(0,T — | , 
With the specific choice Tp = T, we get 
V(0) = B(O, T)Er|V(T)]. 


Observe that in this expression the discount factor (the initial bond price) 
is deterministic even though the interest rate r(t) may be stochastic. This 
feature often leads to useful simplifications in pricing interest rate derivatives. 
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To use any of the price representations above derived through a change 
of numeraire, we need to know the dynamics of the underlying asset prices 
under the corresponding probability measure. For example, if in (1.45) the 
terminal value V (T) is a function of the values S;(T) of the underlying assets, 
then to estimate the rightmost expectation through Monte Carlo we need to 
be able to simulate paths of the underlying assets according to their dynamics 
under Ps,. We encountered the same issue in Section 1.2.2 in pricing under 
the risk-neutral measure Pg. There we noted that changing from the objective 
measure P, to the risk-neutral measure had the effect of changing the drifts 
of all prices to the risk-free rate; an analogous change of drift applies more 
generally in changing numeraire. 

Based on the dynamics in (1.41), we may write the asset price Sg(t) as 


salt) = Sadexn (f° [r(w) = Heat] au f 


with W a standard Brownian motion under Pg. Here, we have implicitly 
generalized the setting in (1.41) to allow the short rate to be time-varying 
and even stochastic; we have also abbreviated oa(S (u), u) as alu) to lighten 
notation. From this and the definition of Ps,, we therefore have 


(Gt) =e (0 -ilos dus f oat" awo). 


Through the Girsanov Theorem (see Appendix B), we find that changing 
measure from Pg to Ps, has the effect of adding a drift to W. More precisely, 
the process W? defined by 


dW4(t) = —oalt) dt + dW (t) (1.48) 


is a standard Brownian motion under Ps,. Making this substitution in (1.41), 
we find that 


t 


oa(u)! wu) ) , (1.47) 


= r(t) dt + oilt)! dW (t) 


= r(t) dt + oilt)! [AWT (t) + og(t) dt] 
= [r(t) + o;(t)' ca(t)] dt + o:(t)' dW? (t) 
= [r(t) + Xialt)] dt + oilt)! dW“ (t) (1.49) 


with Yia(t) = o;(t) | oa(t). Thus, when we change measures from P3 to Ps,, 
an additional term appears in the drift of S; reflecting the instantaneous 
covariance between S; and the numeraire asset Sy. 

The distinguishing feature of this change of measure is that it makes the 
ratios S;(t)/Sq(t) martingales. This is already implicit in (1.46) because each 
S;(¢) is an attainable price process and thus a candidate for V(t). To make 
the martingale property more explicit, we may use (1.47) for S; and Sq and 
then simplify using (1.48) to write the ratio as 
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Silt) 
Sa(t) 


aa) exp (-4 [ loilu) — oa(u)) |]? du + io sozla aw*(u) | 


This reveals that S;(t)/Sq(t) is an exponential martingale (see (B.21) in Ap- 
pendix B) under Ps, because W® is a standard Brownian motion under that 
measure. This also provides a convenient way of thinking about asset price 
dynamics under the measure Ps,: under this measure, the drifts of the asset 
prices make the ratios S;(t)/S,(t) martingales. 


1.2.4 The Market Price of Risk 


In this section we conclude our overview of the principles underlying deriv- 
atives pricing by returning to the idea of a stochastic discount factor intro- 
duced in Section 1.2.1 and further developing its connections with the absence 
of arbitrage, market completeness, and dynamic hedging. Though not stricly 
necessary for the application of Monte Carlo (which is based on the pricing 
relations (1.39) and (1.45)), these ideas are important parts of the underlying 
theory. 

We proceed by considering the dynamics of a stochastic discount factor 
Z(t) as defined in Section 1.2.1. Just as the likelihood ratio process (dPg/dPz.)+ 
defined in (1.37) is a positive martingale under P,, its reciprocal (dP, /dPg) 
is a positive martingale under Pg; this is a general change of measure identity 
and is not specific to this context. From (1.37) we find that (dP,/dPs); = 
Z(t)/B(t) and thus that e~™ Z(t) is a positive martingale under Pg. (For 
simplicity, we assume the short rate r is constant.) This suggests that Z(t) 
should evolve according to an SDE of the form 

dZ (t 

Fy = rdt + v(t)! dW(t), (1.50) 
for some process v, with W continuing to be a standard Brownian motion 
under Ps. Indeed, under appropriate conditions, the martingale representation 
theorem (Appendix B) ensures that the dynamics of Z must have this form. 

Equation (1.50) imposes a restriction on the dynamics of the underlying 
assets S; under the objective probability measure P,. The dynamics of the S; 
under the risk-neutral measure are given in (1.41). Switching from Pg back 
to P, is formally equivalent to applying a change of numeraire from 8(t) to 
Z(t). The process Z(t) may not correspond to an asset price, but this has no 
effect on the mechanics of the change of measure. 

We saw in the previous section that switching from Pg, to Ps, had the 
effect of adding a drift to W; more precisely, the process Wt defined in (1.48) 
becomes a standard Brownian motion under Ps,. We saw in (1.49) that this 
has the effect of adding a term to the drifts of the asset prices as viewed under 
Ps. By following exactly the same steps, we recognize that the likelihood ratio 
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( a) =e" Z(t) = exp ( | —ijv(u)||? du + J Pi aw(u)) 


implies (through the Girsanov Theorem) that 
dW° = —v(t) dt + dW (t) 


defines a standard Brownian motion under P, and that the asset price dy- 
namics can be expressed as 
dSi(t) 
S;(t) 


= rdt+o;(t)' dW (t) 


= rdt+o;(t)' [dW°(t) + v(t) dt] 
= [r + v(t)! 0;(t)| dt + o;(t)' dW°(t). (1.51) 


Comparing this with our original specification in (1.23), we find that the 
existence of a stochastic discount factor implies that the drifts must have the 


form 
uilt) =r + v(t)! c(t). (1.52) 


This representation suggests an interpretation of v as a risk premium. ‘The 
components of v determine the amount by which the drift of a risky asset will 
exceed the risk-free rate r. In the case of a scalar W° and v, from the equation 
ui = r+vo; we see that the excess return u;i — r generated by a risky asset is 
proportional to its volatility 0;, with v the constant of proportionality. In this 
sense, V is the market price of risk; it measures the excess return demanded by 
investors per unit of risk. In the vector case, each component v; may similarly 
be interpreted as the market price of risk associated with the jth risk factor 
— the jth component of W°. It should also be clear that had we assumed 
the drifts in (1.23) to have the form in (1.52) (for some v) from the outset, 
we could have defined a stochastic discount factor Z from v and (1.50). Thus, 
the existence of a stochastic discount factor and a market price of risk vector 
are essentially equivalent. 

An alternative line of argument (which we mention but do not develop) 
derives the market price of risk in a more fundamental way as the aggregate 
effect of the individual investment and consumption decisions of agents in an 
economy. Throughout this section, we have taken the dynamics of the asset 
prices to be specified exogenously. In a more general formulation, asset prices 
result from balancing supply and demand among agents who trade to optimize 
their lifetime investment and consumption; the market price of risk is then 
determined through the risk aversion of the agents as reflected in their utility 
for wealth and consumption. Thus, in a general equilibrium model of this type, 
the market price of risk emerges as a consequence of investor preferences and 
not just as a constraint to preclude arbitrage. For more on this approach, see 


Chapter 10 of Duffie [98]. 
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Incomplete Markets 


The economic foundation of the market price of risk and the closely related 
concept of a stochastic discount factor is particularly important in an in- 
complete market. A complete market is one in which all risks that affect asset 
prices can be perfectly hedged. Any new asset (such as an option on one of the 
existing assets) introduced into a complete market is redundant in the sense 
that it can be replicated by trading in the other assets. Derivative prices are 
thus determined by the absence of arbitrage. In an incomplete market, some 
risks cannot be perfectly hedged and it is therefore possible to introduce gen- 
uinely new assets that cannot be replicated by trading in existing assets. In 
this case, the absence of arbitrage constrains the price of a derivative security 
but may not determine it uniquely. 

For example, market incompleteness may arise because there are fewer 
traded assets than driving Brownian motions. In this case, there may be infi- 
nitely many solutions to (1.52), and thus infinitely many choices of stochastic 
discount factor Z(t) for which $;(t)/Z(t) will be martingales, 7 = 1,...,d. 
Similarly, there are infinitely many possible risk-neutral measures, meaning 
measures equivalent to the original one under which e~™S;(t) are martin- 
gales. As a consequence of these indeterminacies, the price of a new security 
introduced into the market may not be uniquely determined by the prices of 
existing assets. The machinery of derivatives pricing is largely inapplicable in 
an incomplete market. 

Market incompleteness can arise in various ways; a few examples should 
serve to illustrate this. Some assets are not traded, making them inaccessible 
for hedging. How would one eliminate the risk from an option on a privately 
held business, a parcel of land, or a work of art? Some sources of risk may not 
correspond to asset prices at all — think of hedging a weather derivative with 
a payoff tied to rainfall or temperature. Jumps in asset prices and stochastic 
volatility can often render a market model incomplete by introducing risks 
that cannot be eliminated through trading in other assets. In such cases, 
pricing derivatives usually entails making some assumptions, sometimes only 
implicitly, about the market price for bearing unhedgeable risks. 


2 


Generating Random Numbers and Random 
Variables 


This chapter deals with algorithms at the core of Monte Carlo simulation: 
methods for generating uniformly distributed random variables and methods 
for transforming those variables to other distributions. These algorithms may 
be executed millions of times in the course of a simulation, making efficient 
implementation especially important. 

Uniform and nonuniform random variate generation have each spawned a 
vast research literature; we do not attempt a comprehensive account of either 
topic. The books by Bratley, Fox, and Schrage [59], Devroye [95], Fishman 
[121], Gentle [136], Niederreiter [281], and others provide more extensive cov- 
erage of these areas. We treat the case of the normal distribution in more 
detail than is customary in books on simulation because of its importance in 
financial engineering. 


2.1 Random Number Generation 


2.1.1 General Considerations 


At the core of nearly all Monte Carlo simulations is a sequence of apparently 
random numbers used to drive the simulation. In analyzing Monte Carlo meth-, 
ods, we will treat this driving sequence as though it were genuinely random. 
This is a convenient fiction that allows us to apply tools from probability and 
statistics to analyze Monte Carlo computations — convenient because modern 
pseudorandom number generators are sufficiently good at mimicking genuine 
randomness to make this analysis informative. Nevertheless, we should be 
aware that the apparently random numbers at the heart of a simulation are 
in fact produced by completely deterministic algorithms. 

The objectives of this section are to discuss some of the primary consid- 
erations in the design of random number generators, to present a few simple 
generators that are good enough for practical use, and to discuss their imple- 
mentation. We also provide references to a few more sophisticated (though 
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not necessarily better) methods. Elegant theory has been applied to the prob- 
lem of random number generation, but it is mostly unrelated to the tools we 
use elsewhere in the book (with the exception of Chapter 5), so we do not 
treat the topic in depth. The books of Bratley, Fox, and Schrage [59], Fishman 
[121], Gentle [136], Knuth [212], and Niederreiter [281], and the survey article 
of L’Ecuyer [223] provide detailed treatment and extensive references to the 
literature. 

Before discussing sequences that appear to be random but are not, we 
should specify what we mean by a generator of genuinely random numbers: 
we mean a mechanism for producing a sequence of random variables U1, U2,... 


with the property that 


(i) each U; is uniformly distributed between 0 and 1; 
(ii) the U; are mutually independent. 


Property (i) is a convenient but arbitrary normalization; values uniformly 
distributed between 0 and 1/2 would be just as useful, as would values from 
nearly any other simple distribution. Uniform random variables on the unit 
interval can be transformed into samples from essentially any other distribu- 
tion using, for example, methods described in Section 2.2 and 2.3. Property 
(ii) is the more important one. It implies, in particular, that all pairs of values 
should be uncorrelated and, more generally, that the value of U; should not 
be predictable from 04,...,U;—1. 

A random number generator (often called a pseudorandom number gener- 
ator to emphasize that it only mimics randomness) produces a finite sequence 
of numbers u1, u2,..., ug in the unit interval. Typically, the values generated 
depend in part on input parameters specified by the user. Any such sequence 
constitutes a set of possible outcomes of independent uniforms U,,...,UxK. 
A good random number generator is one that satisfies the admittedly vague 
requirement that small (relative to K) segments of the sequence uj,...,UK 
should be difficult to distinguish from a realization of independent uniforms. 

An effective generator therefore produces values that appear consistent 
with properties (i) and (ii) above. If the number of values K is large, the 
fraction of values falling in any subinterval of the unit interval should be 
approximately the length of the subinterval — this is uniformity. Independence 
suggests that there should be no discernible pattern among the values. To put 
this only slightly more precisely, statistical tests for independence should not 
easily reject segments of the sequence u1,..., UK. 

We can make these and other considerations more concrete through ex- 
amples. A linear congruential generator is a recurrence of the following form: 


Li41 = ax; mod m (2.1) 


Uzt = Li+1/mM (2.2) 


Here, the multiplier a and the modulus m are integer constants that determine 
the values generated, given an initial value (seed) zo. The seed is an integer 
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between 1 and m — 1 and is ordinarily specified by the user. The operation 
y mod m returns the remainder of y (an integer) after division by m. In other 


words, 
y mod m = y — |y/m]m, (2.3) 


where |x| denotes the greatest integer less than or equal to x. For example, 7 
mod 5 is 2; 10 mod 5 is 0; 43 mod 5 is 3; and 3 mod 5 is 3. Because the result 
of the mod m operation is always an integer between 0 and m — 1, the output 
values u; produced by (2.1)-(2.2) are always between 0 and (m — 1)/m; in 
particular, they lie in the unit interval. 

Because of their simplicity and potential for effectiveness, linear congruen- 
tial generators are among the most widely used in practice. We discuss them 
in detail in Section 2.1.2. At this point, we use them to illustrate some gen- 
eral considerations in the design of random number generators. Notice that 
the linear congruential generator has the form 


Titi = f (xi), Yer = 9(%i+1), (2.4) 


for some deterministic functions f and g. If we allow the z; to be vectors, 
then virtually all random number generators fit this general form. 

Consider the sequence of x; produced in (2.1) by a linear congruential 
generator with a = 6 and m = 11. (In practice, m should be large; these 
values are solely for illustration.) Starting from x 9 = 1, the next value is 
6 mod 11 = 6, followed by (6-6) mod 11 = 3. The seed ap = 1 thus produces 


the sequence 
1, 6, 3, 7, 9, 10, 5, 8, 4, 2, 1, 6, .... 


Once a value is repeated, the entire sequence repeats. Indeed, since a computer 
can represent only a finite number of values, any recurrence of the form in 
(2.4) will eventually return to a previous x; and then repeat all values that 
followed that x;. Observe that in this example all ten distinct integers between 
1 and m-— 1 appeared in the sequence before a value was repeated. (If we were 
to start the sequence at 0, all subsequent values would be zero, so we do not 
allow zo = 0.) If we keep m = 11 but take a = 3, the seed x = 1 yields 


ty OG hes, 


whereas xo = 2 yields 
De Ox Ty LO iy Sones 


Thus, in this case, the possible values {1, 2,..., 10} split into two cycles. This 
means that regardless of what xo is chosen, a multiplier of a = 3 produces just 
five distinct numbers before it repeats, whereas a multiplier of a = 6 produces 
all ten distinct values before repeating. A linear congruential generator that 
produces all m — 1 distinct values before repeating is said to have full period. 
In practice we would like to be able to generate (at least) tens of millions 
of distinct values before repeating any. Simply choosing m to be very large 
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does not ensure this property because of the possibility that a poor choice of 
parameters a and m may result in short cycles among the values {1,2,...,m— 
1}. 

With these examples in mind, we discuss the following general considera- 
tions in the construction of a random number generator: 


o Period length. As already noted, any random number generator of the form 
(2.4) will eventually repeat itself. Other things being equal, we prefer gen- 
erators with longer periods — i.e., generators that produce more distinct 
values before repeating. The longest possible period for a linear congruen- 
tial generator with modulus m is m — 1. For a linear congruential generator 
with full period, the gaps between the values u; produced are of width 
1/m; hence, the larger m is the more closely the values can approximate a 
uniform distribution. 

o Reproducibility. One might be tempted to look to physical devices — a 
computer’s clock or a specially designed electronic mechanism — to generate 
true randomness. One drawback of a genuinely random sequence is that it 
cannot be reproduced easily. It is often important to be able to rerun a 
simulation using exactly the same inputs used previously, or to use the same 
inputs in two or more different simulations. This is easily accomplished with 
a linear congruential generator or any other procedure of the general form 
(2.4) simply by using the same seed zo. 

o Speed. Because a random number generator may be called thousands or even 
millions of times in a single simulation, it must be fast. It is hard to imagine 
an algorithm simpler or faster than the linear congruential generator; most 
of the more involved methods to be touched on in Section 2.1.5 remain fast 
in absolute terms, though they involve more operations per value generated. 
The early literature on random number generation includes strategies for 
saving computing time through convenient parameter choices. For example, 
by choosing m to be a power of 2, the mod m operation can be implemented 
by shifting bits, without explicit division. Given current computing speeds, 
this incremental speed-up does not seem to justify choosing a generator 
with poor distributional properties. 

o Portability. An algorithm for generating random numbers should produce 
the same sequence of values on all computing platforms. The quest for 
speed and long periods occasionally leads to implementations that depend 
on machine-specific representations of numbers. Some implementations of 
linear congruential generators rely on the way overflow is handled on par- 
ticular computers. We return to this issue in the next section. 

o Randomness. The most important consideration is the hardest to define or 
ensure. There are two broad aspects to constructing generators with appar- 
ent randomness: theoretical properties and statistical tests. Much is known 
about the structure of points produced by the most widely used generators 
and this helps narrow the search for good parameter values. Generators with 
good theoretical properties can then be subjected to statistical scrutiny to 
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test for evident departures from randomness. Fortunately, the field is suffi- 
ciently well developed that for most applications one can comfortably use 
one of many generators in the literature that have survived rigorous tests 


and the test of time. 


2.1.2 Linear Congruential Generators 


The general linear congruential generator, first proposed by Lehmer [229], 
takes the form 


Li+1 = (ax; + c) mod m 


UWi+1 = Lit1/m 


This is sometimes called a mixed linear congruential generator and the mul- 
tiplicative case in the previous section a pure linear congruential generator. 
Like a and m, the parameter c must be an integer. 

Quite a bit is known about the structure of the sets of values {u1,... ug} 
produced by this type of algorithm. In particular, simple conditions are avail- 
able ensuring that the generator has full period — i.e., that the number of 
distinct values generated from any seed zo is m — 1. If c Æ 0, the conditions 
are (Knuth [212, p.17]) 


(a) cand m are relatively prime (their only common divisor is 1); 
(b) every prime number that divides m divides a — 1; 
(c) a—1 is divisible by 4 if m is. 


As a simple consequence, we observe that if m is a power of 2, the generator 
has full period if c is odd and a = 4n + 1 for some integer n. 
If c = 0 and m is prime, full period is achieved from any xp Æ 0 if 


o a™~! — 1 is a multiple of m; 
o af — 1 is not a multiple of m for j = 1,...,m—2. 


A number a satisfying these two properties is called a primitive root of m. 
Observe that when c = 0 the sequence {z;} becomes 


To, AX, a To, a°xo, ... (mod m). 


The sequence first returns to zo at the smallest k for which azo mod m = zo. 
This is the smallest k for which aë mod m = 1; i.e., the smallest k for which 
a — 1 is a multiple of m. So, the definition of a primitive root corresponds 
precisely to the requirement that the sequence not return to x9 until a’~!zp9. 
It can also be verified that when a is a primitive root of m, all x; are nonzero if 
£o is nonzero. This is important because if some x; were 0, then all subsequent 
values generated would be too. 

Marsaglia [249| demonstrates that little additional generality is achieved 
by taking c Æ 0. Since a generator with a nonzero c is slower than one without, 
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it is now customary to take c = 0. In this case, it is convenient to take m to 
be prime, since it is then possible to construct full-period generators simply 
by finding primitive roots of m. 

Table 2.1 displays moduli and multipliers for seven linear congruential 
generators that have been recommended in the literature. In each case, the 
modulus m is a large prime not exceeding 2°! — 1. This is the largest inte- 
ger that can be represented in a 32-bit word (assuming one bit is used to 
determine the sign) and it also happens to be a prime — a Mersenne prime. 
Each multiplier a in the table is a primitive root of the corresponding mod- 
ulus, so all generators in the table have full period. The first generator listed 
was dubbed the “minimal standard” by Park and Miller [294]; though widely 
used, it appears to be inferior to the others listed. Among the remaining gen- 
erators, those identified by Fishman and Moore [123] appear to have slightly 
better uniformity while those from L’Ecuyer [222] offer a computational ad- 
vantage resulting from having comparatively smaller values of a (in particular, 
a < m). We discuss this computational advantage and the basis on which 
these generators have been compared next. 

Generators with far longer periods are discussed in Section 2.1.5. L’Ecuyer, 
Simard, and Wegenkitt] [228] reject all “small” generators like those in Ta- 
ble 2.1 as obsolete. Section 2.1.5 explains how they remain useful as compo- 
nents of combined generators. 


Modulus th Multiplier a Reference 


St 16807 Lewis, Goodman, and Miller [234], 
(= 2147483647) Park and Miller [294] 

39373 L’Ecuyer [222] 

742938285 Fishman and Moore [123] 

950706376 Fishman and Moore [123] 

1226874159 Fishman and Moore [123] 
2147483399 ~ 40692 L’Ecuyer [222] 
2147483563 40014 L’Ecuyer [222] 


Table 2.1. Parameters for linear congruential generators. The generator in the first 
row appears to be inferior to the rest. 


2.1.3 Implementation of Linear Congruential Generators 


Besides speed, avoiding overflow is the main consideration in implementing a 
linear congruential generator. If the product ax; can be represented exactly 
for every x; in the sequence, then no overflow occurs. If, for example, every 
integer from 0 to a(m — 1) can be represented exactly in double precision, 
then implementation in double precision is straightforward. 

If the multiplier a is large, as in three of the generators of Table 2.1, even 
double precision may not suffice for an exact representation of every product 
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azi. In this case, the generator may be implemented by first representing the 
multiplier as a = 2%a; + a2, with aj,a2 < 2%, and then using 


ax; mod m = (a;(2°x; mod m) + azz; mod m) mod m. 


For example, with a = 16 and m = 2?! — 1 this implementation never requires 
an intermediate value as large as 247, even though az; could be close to 2°. 

Integer arithmetic is sometimes faster than floating point arithmetic, in 
which case an implementation in integer variables is more appealing than 
one using double precision. Moreover, if variables y and m are represented as 
integers in a computer, the integer operation y/m produces | y/m]|,so y mod m 
can be implemented as y — (y/m) * m (see (2.3)). However, working in integer 
variables restricts the magnitude of numbers that can be represented far more 
than does working in double precision. To avoid overflow, a straightforward 
implementation of a linear congruential generator in integer variables must 
be restricted to an unacceptably small modulus — e.g., 215 — 1. If a is not 
too large (say a < ./m, as in the first two and last two entries of Table 2.1), 
Bratley, Fox, and Schrage [59] show that a faster implementation is possible 
using only integer arithmetic, while still avoiding overflow. 

Their method is based on the following observations. Let 


gq=|m/al, r=m moda 


so that the modulus can be represented as m = aq +r. The calculation to be 
carried out by the generator is 


aL; 
ax; mod m = ax; — m 
m 


(eB) E-E) e 


The first term on the right in (2.5) satisfies 


Making this substitution in (2.5) yields 


arı mod m= ala; mod a)- |Z| r+ (Z -E]n eo 


To prevent overflow, we need to avoid calculation of the potentially large term 
azi on the right side of (2.6). In fact, we can entirely avoid calculation of 
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Meie an 


if we can show that this expression takes only the values 0 and 1. For in this 
case, the last term in (2.6) is either 0 or m, and since the final calculation 
must result in a value in {0,1,...,m — 1}, the last term in (2.6) is m precisely 


when 


a(x; mod q) — =| pv, 


Thus, the last term in (2.6) adds m to the first two terms precisely when not 


doing so would result in a value outside of {0,1,..., m — 1}. 
It remains to verify that (2.7) takes only the values 0 and 1. This holds if 


oE (2.8) 


But z; never exceeds m — 1, and 


m-1 a(m-1) _r(m-1) 


C 
. 


q m qm 


Thus, (2.8) holds if r < q; a simple sufficient condition ensuring this is a < 
ym. 

The result of this argument is that (2.6) can be implemented so that every 
intermediate calculation results in an integer between —(m — 1) and m — 1, 
allowing calculation of az; mod m without overflow. In particular, explicit 
calculation of (2.7) is avoided by checking indirectly whether the result of this 
calculation would be 0 or 1. L’Ecuyer [222] gives a simple implementation of 
this idea, which we illustrate in Figure 2.1. 


(m,a integer constants 
q,r precomputed integer constants, 
with q = |m/a|, r = m mod a 


x integer variable holding the current x;) 
k — t/q 
xz ax*(x—kx*q)—-kx*r 
if (x <0) z=r+m 


Fig. 2.1. Implementation of az mod m in integer arithmetic without overflow, 
assuming r < q (e.g.,a < ynm). 


The final step in using a congruential generator — converting the z; € 
{0,1,...,m—1} to a value in the unit interval — is not displayed in Figure 2.1. 
This can be implemented by setting u « x * h where h is a precomputed 
constant equal to 1/m. 
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Of the generators in Table 2.1, the first two and the last two satisfy a < ym 
and thus may be implemented using Figure 2.1. L’Ecuyer [222] finds that 
the second, sixth, and seventh generators listed in the table have the best 
distributional properties among all choices of multiplier a that are primitive 
roots of m and satisfy a < ym, m < 23t — 1. Fishman [121] recommends 
working in double precision in order to get the somewhat superior uniformity 
of the large multipliers in Table 2.1. We will see in Section 2.1.5 that by 
combining generators it is possible to maintain the computational advantage 
of having a < ym without sacrificing uniformity. 


Skipping Ahead 


It is occasionally useful to be able to split a random number stream into ap- 
parently unrelated subsequences. ‘This can be implemented by initializing the 
same random number to two or more distinct seeds. Choosing the seeds arbi- 
trarily leaves open the possibility that the ostensibly unrelated subsequences 
will have substantial overlap. This can be avoided by choosing the seeds far 
apart along the sequence produced by a random number generator. 

With a linear congruential generator, it is easy to skip ahead along the 
sequence without generating intermediate values. If 7,41 = ax; mod m, then 


Titk = afr; mod m. 
This in turn is equivalent to 
Zik = ((aë mod m)zi) mod m. 


Thus, one could compute the constant aë mod m just once and then easily 
produce a sequence of values spaced k apart along the generator’s output. See 
L’Ecuyer, Simard, Chen, and Kelton [227] for an implementation. 

Splitting a random number stream carefully is essential if the subsequences 
are to be assigned to parallel processors running simulations intended to be 
independent of each other. Splitting a stream can also be useful when simu- 
lation is used to compare results from a model at different parameter values. 
In comparing results, it is generally preferable to use the same random num- 
bers for both sets of simulations, and to use them for the same purpose in 
both to the extent possible. For example, if the model involves simulating d 
asset prices, one would ordinarily want to arrange matters so that the ran- 
dom numbers used to simulate the ¿th asset at one parameter value are used 
to simulate the same asset at other parameter values. Dedicating a separate 
subsequence of the generator to each asset ensures this arrangement. 


2.1.4 Lattice Structure 


In discussing the generators of Table 2.1, we alluded to comparisons of their 
distributional properties. We now provide a bit more detail on how these 
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comparisons are made. See Knuth [212] and Neiderreiter [281] for far more 
thorough treatments of the topic. 

If the random variables U1, U2,... are independent and uniformly distrib- 
uted over the unit interval, then (U1, U2) is uniformly distributed over the 
unit square, (U1, U2, U3) is uniformly distributed over the unit cube, and so 
on. Hence, one way to evaluate a random number generator is to form points 
in [0,1]? from consecutive output values and measure how uniformly these 
points fill the space. 

The left panel of Figure 2.2 plots consecutive overlapping pairs (u1, u2), 
(u2,U3), ---, (u10, u11) produced by a linear congruential generator. The pa- 
rameters of the generator are a = 6 and m = 11, a case considered in Sec- 
tion 2.1.1. The graph immediately reveals a regular pattern: the ten distinct 
points obtained from the full period of the generator lie on just two parallel 
lines through the unit square. 

This phenomenon is characteristic of all linear congruential generators 
(and some other generators as well), though it is of course particularly pro- 
nounced in this simple example. Marsaglia [248] showed that overlapping d- 
tuples formed from consecutive outputs of a linear congruential generator with 
modulus m lie on at most (dlm)! d hyperplanes in the d-dimensional unit 
cube. For m = 23! — 1, this is approximately 108 with d = 3 and drops below 
39 at d = 10. Thus, particularly in high dimensions, the lattice structure of 
even the best possible linear congruential generators distinguishes them from 
genuinely random numbers. 

The right panel of Figure 2.2, based on a similar figure in L’Ecuyer [222], 
shows the positions of points produced by the first generator in Table 2.1. 
The figure magnifies the strip {(u1, u2) : uy < .001} and plots the first 10,005 
points that fall in this strip starting from a seed of ro = 8835. (These are all 
the points that fall in the strip out of the first ten million points generated 
by the sequence starting from that seed.) At this magnification, the lattice 
structure becomes evident, even in this widely used method. 

The lattice structure of linear congruential generators is often used to 
compare their outputs and select parameters. There are many ways one might 
try to quantify the degree of equidistribution of points on a lattice. The most 
widely used in the analysis of random number generators is the spectral test, 
originally proposed by Coveyou and Macpherson [88]. For each dimension 
d and each set of parallel hyperplanes containing all points in the lattice, 
consider the distance between adjacent hyperplanes. The spectral test takes 
the maximum of these distances over all such sets of parallel hyperplanes. 

To see why taking the maximum is appropriate, consider again the left 
panel of Figure 2.2. The ten points in the graph lie on two positively sloped 
lines. They also lie on five negatively sloped lines and ten vertical lines. De- 
pending on which set of lines we choose, we get a different measure of distance 
between adjacent lines. The maximum distance is achieved by the two posi- 
tively sloped lines passing through the points, and this measure is clearly the 
one that best captures the wide diagonal swath left empty by the generator. 
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Ui+] 


0 ui; 1 
a=6,m=11 a =16807, m=2°- 1 


Fig. 2.2. Lattice structure of linear congruential generators. 


Although the spectral test is an informative measure of uniformity, it does 
not provide a strict ranking of generators because it produces a separate value 
for each dimension d. It is possible for each of two generators to outperform 
the other at some values of d. Fishman and Moore [123] and L’Ecuyer [222] 
base their recommendations of the values in Table 2.1 on spectral tests up to 
dimension d = 6; computing the spectral test becomes increasingly difficult in 
higher dimensions. L’Ecuyer [222] combines results for d =2-6 into a worst- 
case figure of merit in order to rank generators. 

Niederreiter [281] analyzes the uniformity of point sets in the unit hy- 
percube (including those produced by various random number generators) 
through discrepancy measures, which have some appealing theoretical fea- 
tures not shared by the spectral test. Discrepancy measures are particularly 
important in the analysis of quasi-Monte Carlo methods. 

It is also customary to subject random number generators to various statis- 
tical tests of uniformity and independence. See, e.g., Bratley, Fox, and Schrage 
[59] or Knuth [212] for a discussion of some of the tests often used. 

Given the inevitable shortcomings of any practical random number gener- 
ator, it is advisable to use only a small fraction of the period of a generator. 
This again points to the advantage of generators with long periods — much 
longer than 23t, 


2.1.5 Combined Generators and Other Methods 


We now turn to a discussion of a few other methods for random number gener- 
ation. Methods that combine linear congruential generators appear to be par- 
ticularly promising because they preserve attractive computational features of 
these generators while extending their period and, in some cases, attenuating 
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their lattice structure. A combined generator proposed by L’Ecuyer [224] and 
discussed below appears to meet the requirements for speed, uniformity, and 
a long period of most current applications. We also note a few other directions 


of work in the area. 


Combining Generators 


One way to move beyond the basic linear congruential generator combines 
two or more of these generators through summation. Wichmann and Hill 
[355] propose summing values in the unit interval (i.e., after dividing by the 
modulus); L’Ecuyer [222] sums first and then divides. 

To make this more explicit, consider J generators, the 7th having parame- 


ters aj, Mj: 
ti OF A MO. Wis Wes Sa. eS end. 


The Wichmann-Hill combination sets u;,; equal to the fractional part of 
U1,i41 + U2i41 +++ + Ugi L’Ecuyer’s combination takes the form 


J 
Due > UV mod (mı — 1) (2.9) 
j=1 
and j 
© J Tii/M, Ti+ı > O; 
Uii = l (mı im 1)/my, Tizi Ro 0. (2.10) 


This assumes that m; is the largest of the m,;. 

A combination of generators can have a much longer period than any 
of its components. A long period can also be achieved in a single generator 
by using a larger modulus, but a larger modulus complicates the problem 
of avoiding overflow. In combining generators, it is possible to choose each 
multiplier a; smaller than ,/m; in order to use the integer implementation of 
Figure 2.1 for each. The sum in (2.9) can then also be implemented in integer 
arithmetic, whereas the Wichmann-Hill summation of u,;; is a floating point 
operation. L’Ecuyer [222] gives a portable implementation of (2.9)-(2.10). He 
also examines a combination of the first and sixth generators of Table 2.1 and 
finds that the combination has no apparent lattice structure at a magnification 
at which each component generator has a very evident lattice structure. This 
suggests that combined generators can have superior uniformity properties as 
well as long periods and computational convenience. 

Another way of extending the basic linear congruential generator uses a 
higher-order recursion of the form 


Ti = (Q1Zi—1 + a2£i—2 +--+ akZi—-k) mod m, (2.11) 


followed by u; = z;i/m; this is called a multiple recursive generator, or MRG. 
A seed for this generator consists of initial values £k—1, Tp_2,..., Zo. 
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Each of the lagged values x;_; in (2.11) can take up to m distinct values, 
so the vector (a;-1,...,%;-~) can take up to m* distinct values. The sequence 
z; repeats once this vector returns to a previously visited value, and if the 
vector ever reaches (0,...,0) all subsequent x; are identically 0. Thus, the 
longest possible period for (2.11) is më — 1. Knuth [212] gives conditions on 
m and aj,...,a% under which this bound is achieved. 

L’Ecuyer [224] combines MRGs using essentially the mechanism in (2.9)- 
(2.10). He shows that the combined generator is, in a precise sense, a close 
approximation to a single MRG with a modulus equal to the product of the 
moduli of the component MRGs. Thus, the combined generator has the advan- 
tages associated with a larger modulus while permitting an implementation 
using smaller values. L’Ecuyer’s investigation further suggests that a combined 
MRG has a less evident lattice structure than the large-modulus MRG it ap- 
proximates, indicating a distributional advantage to the method in addition 
to its computational advantages. 

L’Ecuyer [224] analyzes and recommends a specific combination of two 
MRGs: the first has modulus m = 27! — 1 = 2147483647 and coefficients 
a, = 0, ag = 63308, ag = —183326; the second has m = 2145483479 and 
a, = 86098, a2 = 0, a3 = —539608. The combined generator has a period 
close to 2185, Results of the spectral tests in L’Ecuyer [224] in dimensions 4- 
20 indicate far superior uniformity for the combined generator than for either 
of its components. Because none of the coefficients a; used in this method is 
very large, an implementation in integer arithmetic is possible. L’Ecuyer [224] 
gives an implementation in the C programming language which we reproduce 
in Figure 2.3. We have modified the introduction of the constants for the gen- 
erator, using #define statements rather than variable declarations for greater 
speed, as recommended by L’Ecuyer |225]. The variables x10,...,x22 must 
be initialized to an arbitrary seed before the first call to the routine. 

Figure 2.4 reproduces an implementation from L’Ecuyer [225]. L’Ecuyer 
[225] reports that this combined generator has a period of approximately 2319 
and good uniformity properties at least up to dimension 32. The variables 
si0,...,s24 must be initialized to an arbitrary seed before the first call to 
the routine. The multipliers in this generator are too large to permit a 32-bit 
integer implementation using the method in Figure 2.3, so Figure 2.4 uses 
floating point arithmetic. L’Ecuyer [225] finds that the relative speeds of the 
two methods vary with the computing platform. 


Other Methods 


An alternative strategy for random number generation produces a stream of 
bits that are concatenated to produce integers and then normalized to produce 
points in the unit interval. Bits can be produced by linear recursions mod 2; 


e.g., 
bi = (aibi—1 + agbj;-2 + +++ akbi—-k) mod 2, 
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#-define m1 2147483647 
#define m2 2145483479 
#define al2 63308 
#define al3 —183326 
#define a21 86098 
#define a23 —539608 
#define q12 33921 
#define q13 11714 
#define q21 24919 
#define q23 3976 
#define r12 12979 
#define r13 2883 
#define r21 7417 
#define r23 2071 
#define Invmp1 4.656612873077393e—10; 
int x10, x11, x12, x20, x21, x22; 


int Random() 


int h, p12, p13, p21, p23; 
/* Component 1 */ 

h = x10/q13; p13 = —a13*x(x10—h*q13)—h*r13; 

h = x11/q12; p12 = a12*(x11—h«*q12)—h-xr12; 

if(p13<0) p13 = p13+m1; if(p12<0) p12 = p12+ml1; 

x10 = x11; x11 = x12; x12 = p12—p13; if(x12<0) x12 = x12+ml1; 
/* Component 2 */ 

h = x20/q23; p23 = —a23«(x20—h*q23)—hxr23; 

h = x22/q21; p21 = a21*(x22—hx*q21)—h-xr21; 

if(p23<0) p23 = p23-+m2; if(p21<0) p21 = p21+m2; 

/* Combination */ 

if (x12<x22) return (x12—x22+m1); else return (x12—x22); 


double Uniform01() 


int Z; 
Z=Random(); if(Z==0) Z=m1; return (Z*Invmp1); 


Fig. 2.3. Implementation in C of a combined multiple recursive generator using 
integer arithmetic. The generator and the implementation are from L’Ecuyer [224]. 


with all a; equal to 0 or 1. This method was proposed by Tausworthe [346]. It 
can be implemented through a mechanism known as a feedback shift register. 
The implementation and theoretical properties of these generators (and also 
of generalized feedback shift register methods) have been studied extensively. 
Matsumoto and Nishimura [258] develop a generator of this type with a period 
of 219987 — 1 and apparently excellent uniformity properties. They provide C 


code for its implementation. 
Inversive congruential generators use recursions of the form 


Ti+. = (ax; +c) mod m, 


where the (mod m)-inverse x~ of x is an integer in {1,...,m— 1} (unique 
if it exists) satisfying rx~ = 1 mod m. This is an example of a nonlinear 
congruential generator. Inversive generators are free of the lattice structure 
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double s10, s11, s12, s13, s14, s20, s21, s22, s23, s24; 


#define norm 2.3283163396834613e-10 
#define ml 4294949027.0 

#define m2 4294934327.0 

#-define al2 1154721.0 
#define al4 1739991.0 
#-define al5n 1108499.0 
#define a21 1776413.0 
#define a23 865203.0 
#define a25n 1641052.0 


double MRG32k5a () 
{ 


long k; 
double pl, p2; 

/* Component 1 «/ 

pl = al2 x s13 — al5n * s10; 

if (pl > 0.0) pl —= al4 * ml; 

pl += al4 * sl1;k = pl / ml; pl —= k * ml; 

if (pl < 0.0) p1 += ml; 

s10 = sll; s11 = s12; s12 = s13; s13 = s14; s14 = pl; 
/* Component 2 */ 

p2 = a2] x s24 — a25n * s20; 

if (p2 > 0.0) p2 —= a23 * m2; 

p2 += a23 * s22; k = p2 / m2; p2 —= k * m2; 

if (p2 < 0.0) p2 += m2; 

s20 = s21; s21 = s22; s22 = s23; s23 = s24; s24 = p2; 
/* Combination */ 

if (pl <= p2) return ((pl — p2 + m1) * norm); 

else return ((p1 — p2) * norm); 


Fig. 2.4. Implementation in C of a combined multiple recursive generator using 
floating point arithmetic. The generator and implementation are from L’Ecuyer 


[225]. 


characteristic of linear congruential generators but they are much more com- 
putationally demanding. They may be useful for comparing results in cases 
where the deficiencies of a random number generator are cause for concern. 
See Eichenauer-Herrmann, Herrmann, and Wegenkittl [110] for a survey of 
this approach and additional references. 


2.2 General Sampling Methods 


With an introduction to random number generation behind us, we hence- 
forth assume the availability of an ideal sequence of random numbers. More 
precisely, we assume the availability of a sequence U1, U2,... of independent 
random variables, each satisfying 


P(U Su) un 0u (2.12) 
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i.e., each uniformly distributed between 0 and 1. A simulation algorithm trans- 
forms these independent uniforms into sample paths of stochastic processes. 

Most simulations entail sampling random variables or random vectors from 
distributions other than the uniform. A typical simulation uses methods for 
transforming samples from the uniform distribution to samples from other 
distributions. There is a large literature on both general purpose methods 
and specialized algorithms for specific cases. In this section, we present two of 
the most widely used general techniques: the inverse transform method and 
the acceptance-rejection method. 


2.2.1 Inverse Transform Method 


Suppose we want to sample from a cumulative distribution function F; i.e., 
we want to generate a random variable X with the property that P(X < x) = 
F(x) for all x. The inverse transform method sets 


X = FTU), U ~ Unif[0,1], (2.13) 


where F'~! is the inverse of F and Unif[0,1] denotes the uniform distribution 
on [0,1]. 


Xa 0 X] 


Fig. 2.5. Inverse transform method. 


This transformation is illustrated in Figure 2.5 for a hypothetical cumula- 
tive distribution F’. In the figure, values of u between 0 and F (0) are mapped 
to negative values of x whereas values between F'(0) and 1 are mapped to 
positive values. The left panel of Figure 2.6 depicts a cumulative distribution 
function with a jump at 79; i.e., 


lim F(x) = F(a—-) < F(z+) = lim F(x). 
x| ro TLTO 
Under the distribution F, the outcome zo has probability F(x+)— F(x—). As 


indicated in the figure, all values of u between u; = F(x—) and ug = F (x+) 
are mapped to Zo. 
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The inverse of F is well-defined if F is strictly increasing; otherwise, we 
need a rule to break ties. For example, we may set 


F-*(u) = inf{x : F(x) > u}; (2.14) 


if there are many values of x for which F(x) = u, this rule chooses the smallest. 

We need a rule like (2.14) in cases where the cumulative distribution F 
has flat sections, because the inverse of F' is not well-defined at such points; 
see, e.g., the right panel of Figure 2.6. Observe, however, that if F is constant 
over an interval [a,b] and if X has distribution F', then 


P(a < X <b) = F(b) — F(a) = 0, 


so flat sections of F correspond to intervals of zero probability for the random 
variable. If F has a continuous density, then F is strictly increasing (and its 
inverse is well-defined) anywhere the density is nonzero. 


° a b 


Fig. 2.6. Inverse transform for distributions with jumps (left) or flat sections (right). 


To verify that the inverse transform (2.13) generates samples from F, we 
check the distribution of the X it produces: 


The second equality follows from the fact that, with Ft as we have defined 
it, the events {F+ (u) < x} and {u < F(x)} coincide for all u and z. The last 
equality follows from (2.12). 

One may interpret the input U to the inverse transform method as a 
random percentile. If F is continuous and X ~ F, then X is just as likely to 
fall between, say, the 20th and 30th percentiles of F as it is to fall between the 
85th and 95th. In other words, the percentile at which X falls (namely F'(X)) 
is uniformly distributed. The inverse transform method chooses a percentile 
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level uniformly and then maps it to a corresponding value of the random 


variable. 
We illustrate the method with examples. These examples also show that 


a direct implementation of the inverse transform method can sometimes be 
made more efficient through minor modifications. 

Example 2.2.1 Exponential distribution. The exponential distribution with 
mean @ has distribution 


F(x) =1—e°7/", x >0. 


This is, for example, the distribution of the times between jumps of a Pois- 
son process with rate 1/0. Inverting the exponential distribution yields the 
algorithm X = —@log(1 — U). This can also be implemented as’ 


X = —Olog(U) (2.15) 
because U and 1 — U have the same distribution. O 


Example 2.2.2 Arcsine law. The time at which a standard Brownian motion 
attains its maximum over the time interval [0,1] has distribution 


2 
F(x) = = arcsin(/x), 0<zr<1. 


The inverse transform method for sampling from this distribution is X = 
sin” (Ur/2), U ~ Unif[0,1]. Using the identity 2sin?(t) = 1 — cos(2t) for 0 < 
t < 1/2, we can simplify the transformation to 


X =4$-—cos(Ur), U ~ Unif(0, 1]. 


O 


Example 2.2.3 Rayleigh distribution. If we condition a standard Brownian 
motion starting at the origin to be at b at time 1, then its maximum over [0, 1] 
has the Rayleigh distribution 


F(z) = 1- e7- gg > b. 


Solving the equation F(x) = u, u € (0,1), results in a quadratic with roots 


b 4. \/ b? — 2 log(1 Zu) 


pe 
2 2 
The inverse of F is given by the larger of the two roots — in particular, we 
must have xz > b since the maximum of the Brownian path must be at least 
as large as the terminal value. Thus, replacing 1 — U with U as we did in 
Example 2.2.1, we arrive at 

b b? — 2log(U) 


aie 
~ 2 
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Even if the inverse of F is not known explicitly, the inverse transform 
method is still applicable through numerical evaluation of F~!. Computing 
F—'(u) is equivalent to finding a root x of the equation F(x) — u = 0. For a 
distribution F with density f, Newton’s method for finding roots produces a 


sequence of iterates 
F(a) — u 


Intl = In — — ar 
given a starting point xo. In the next example, root finding takes a special 


form. 


Example 2.2.4 Discrete distributions. In the case of a discrete distribution, 
evaluation of F~! reduces to a table lookup. Consider, for example, a dis- 
crete random variable whose possible values are cy < -> < Cy. Let p; be the 
probability attached to cj, 7 = 1,...,n, and set qo = 0, 


a 
=)5 Pj, as eee 
j=1 


These are the cumulative probabilities associated with the c;; that is, q; = 
F(c;),7=1,...,n. To sample from this distribution, 


(i) generate a uniform U; 

(ii) find K € {1,...,n} such that gx_a < U < qx; 

(iii) set X = cx. 

The second step can be implemented through binary search. Bratley, Fox, and 
Schrage [59], and Fishman [121] discuss potentially faster methods. O 


Our final example illustrates a general feature of the inverse transform 
method rather than a specific case. 


Example 2.2.5 Conditional distributions. Suppose X has distribution F 
and consider the problem of sampling X conditional on a < X < b, with 
F(a) < F(b). Using the inverse transform method, this is no more difficult 
than generating X unconditionally. If U ~ Unif[0,1], then the random variable 


V defined by 
V = F(a) + [|F (b) — F(a)|U 


is uniformly distributed between F(a) and F(b), and F~'(V) has the desired 
conditional distribution. To see this, observe that 
P(E (a) + [F (b) — F(@)|U < F(x) 
= PU < [F(2) — F(@)|/[FQ@) — F)) 
= [F(x) — F(a) |/[F(6) — F(a)], 


and this is precisely the distribution of X given a < X < b. Either of the 
endpoints a,b could be infinite in this example. O 


PF (V) < x) = 
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The inverse transform method is seldom the fastest method for sampling 
from a distribution, but it has important features that make it attractive 
nevertheless. One is its use in sampling from conditional distributions just il- 
lustrated; we point out two others. First, the inverse transform method maps 
the input U monotonically and — if F is strictly increasing — continuously 
to the output X. This can be useful in the implementation of variance re- 
duction techniques and in sensitivity estimation, as we will see in Chapters 4 
and 7. Second, the inverse transform method requires just one uniform ran- 
dom variable for each sample generated. This is particularly important in 
using quasi-Monte Carlo methods where the dimension of a problem is often 
equal to the number of uniforms needed to generate one “path.” Methods that 
require multiple uniforms per variable generated result in higher-dimensional 
representations for which quasi-Monte Carlo may be much less effective. 


2.2.2 Acceptance-Rejection Method 


The acceptance-rejection method, introduced by Von Neumann [353], is 
among the most widely applicable mechanisms for generating random samples. 
This method generates samples from a target distribution by first generating 
candidates from a more convenient distribution and then rejecting a random 
subset of the generated candidates. The rejection mechanism is designed so 
that the accepted samples are indeed distributed according to the target dis- 
tribution. The technique is by no means restricted to univariate distributions. 

Suppose, then, that we wish to generate samples from a density f defined 
on some set X. This could be a subset of the real line, of RÌ, or a more general 
set. Let g be a density on ¥ from which we know how to generate samples 


and with the property that 
f(z) < cg(x), forall z € X 


for some constant c. In the acceptance-rejection method, we generate a sample 
X from g and accept the sample with probability f(X)/cg(X); this can be 
implemented by sampling U uniformly over (0,1) and accepting X if U < 
f(X)/cg(X). If X is rejected, a new candidate is sampled from g and the 
acceptance test applied again. The process repeats until the acceptance test is 
passed; the accepted value is returned as a sample from f. Figure 2.7 illustrates 
a generic implementation. 

To verify the validity of the acceptance-rejection method, let Y be a sam- 
ple returned by the algorithm and observe that Y has the distribution of X 
conditional on U < f(X)/cg(X). Thus, for any A C X, 


P(Y € A) = P(X EAU < f(X)/cg(X)) 
P(X € A,U < f(X)/cg(X)) 
oe, (2.16) 
PU < f(X)/cg(X)) 
Given X, the probability that U < f(X)/cg(X) is simply f(X)/cg(X) be- 
cause U is uniform; hence, the denominator in (2.16) is given by 
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1. generate X from distribution g 
2. generate U from Unif[0,1] 
3. if U < f(X)/cg(X) 


return X 
otherwise 
go to Step 1. 


Fig. 2.7. The acceptance-rejection method for sampling from density f using can- 
didates from density g. 


PU < XX) = f LE ote) eis (2.17) 


(taking 0/0 = 1 if g(x) = 0 somewhere on ¥). Making this substitution in 
(2.16), we find that 


P(Y € A)=cP(X € A,U < FOX) =e | E g(c)de= | f(a) de. 


Since A is arbitrary, this verifies that Y has density f. 

In fact, this argument shows more: Equation (2.17) shows that the proba- 
bility of acceptance on each attempt is 1/c. Because the attempts are mutually 
independent, the number of candidates generated until one is accepted is geo- 
metrically distributed with mean c. It is therefore preferable to have c close to 
1 (it can never be less than 1 if f and g both integrate to 1). Tighter bounds 
on the target density f result in fewer wasted samples from g. Of course, a 
prerequisite for the method is the ability to sample from g; the speed of the 
method depends on both c and the effort involved in sampling from g. 

We illustrate the method with examples. 


Example 2.2.6 Beta distribution. The beta density on [0, 1] with parameters 
Q1,Q2 > Q0 is given by 


with f 

a _ (ai) (a2) 
B = ay—l 1 2 Q2 1 d = 
(œi, Q2) J x (1-2) x ies) Ta 


and I’ the gamma function. Varying the parameters aj, a2 results in a variety 
of shapes, making this a versatile family of distributions with bounded sup- 
port. Among many other applications, beta distributions are used to model 
the random recovery rate (somewhere between 0 and 100%) upon default of a 
bond subject to credit risk. The case a, = ag = 1/2 is the arcsine distribution 
considered in Example 2.2.2. 
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If a1, 2 > 1 and at least one of the parameters exceeds 1, the beta density 
is unimodal and achieves its maximum at (a; — 1)/(a, +a2—2). Let c be the 
value of the density f at this point. Then f(x) < c for all x, so we may choose 
g to be the uniform density (g(x) = 1, 0 < x < 1), which is in fact the beta 
density with parameters a, = a2 = 1. In this case, the acceptance-rejection 


method becomes 


Generate U1, U2 from Unif[0,1] until cU2 < f(U1) 
Return Uj 


This is illustrated in Figure 2.8 for parameters a; = 3, a2 = 2. 

As is clear from Figure 2.8, generating candidates from the uniform dis- 
tribution results in many rejected samples and thus many evaluations of f. 
(The expected number of candidates generated for each accepted sample is 
c œ 1.778 for the density in the figure.) Faster methods for sampling from 
beta distributions — combining more carefully designed acceptance-rejection 
schemes with the inverse transform and other methods — are detailed in De- 
vroye [95], Fishman [121], Gentle [136], and Johnson, Kotz, and Balakrishnan 


[202]. O 


Accept U if cU, in 
this range 


0 i 1 


Fig. 2.8. Illustration of the acceptance-rejection method using uniformly distributed 
candidates. 


Example 2.2.7 Normal from double exponential. Fishman [121, p.173] illus- 
trates the use of the acceptance-rejection method by generating half-normal 
samples from the exponential distribution. (A half-normal random variable 
has the distribution of the absolute value of a normal random variable.) Fish- 
man also notes that the method can be used to generate normal random 
variables and we present the example in this form. Because of its importance 
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in financial applications, we devote all of Section 2.3 to the normal distribu- 
tion; we include this example here primarily to further illustrate acceptance- 
rejection. 

The double exponential density on (—co, 00) is g(x) = exp(—|x|)/2 and 
the normal density is f(x) = exp(—x?/2)/./2r. The ratio is 


L-a 
LE — [ez Hea Ries ~1.3155=c. 
qx T m 


Thus, the normal density is dominated by the scaled double exponential den- 
sity cg(x), as illustrated in Figure 2.9. A sample from the double exponential 
density can be generated using (2.15) to draw a standard exponential random 
variable and then randomizing the sign. The rejection test u > f(x)/cg(z) 
can be implemented as 


u > exp(—da? + |z| — 4) = exp(—4((a| — 1)?). 


In light of the symmetry of both f and g, it suffices to generate positive 
samples and determine the sign only if the sample is accepted; in this case, 
the absolute value is unnecessary in the rejection test. The combined steps 


are as follows: 
1. generate U1, U2, U3 from Unif[0,1] 


Dee — log(U;) 
3. if U2 > exp(—0.5(X — 1)°) 


go to Step 1 
4. if U3 < 0.5 
X-X 
5. return X 


O 


Example 2.2.8 Conditional distributions. Consider the problem of generat- 
ing a random variable or vector X conditional on X € A, for some set A. In 
the scalar case, this can be accomplished using the inverse transform method 
if A is an interval; see Example 2.2.5. In more general settings it may be dif- 
ficult to sample directly from the conditional distribution. However, so long 
as it is possible to generate unconditional samples, one may always resort to 
the following crude procedure: 


Generate X until X € A 


return X j 


This may be viewed as a degenerate form of acceptance-rejection. Let f 
denote the conditional density and let g denote the unconditional density; 


then 
I/P(XEA)L,zrEA 


rolg or 
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Fig. 2.9. Normal density and scaled double exponential. 


Thus, c = 1/P(X € A) is an upper bound on the ratio. Moreover, since the 
ratio f(x)/cg(x) is either 0 or 1 at every x, it is unnecessary to randomize the 
rejection decision: a candidate X is accepted precisely if X € A. O 


Acceptance-rejection can often be accelerated through the squeeze method, 
in which simpler tests are applied before the exact acceptance threshold 
f(x)/cg(x) is evaluated. The simpler tests are based on functions that bound 
f(x)/cg(x) from above and below. The effectiveness of this method depends 
on the quality of the bounding functions and the speed with which they can 
be evaluated. See Fishman [121] for a detailed discussion. 

Although we have restricted attention to sampling from densities, it should 
be clear that the acceptance-rejection method also applies when f and g are 
replaced with the mass functions of discrete distributions. 

The best methods for sampling from a specific distribution invariably rely 
on special features of the distribution. Acceptance-rejection is frequently com- 
bined with other techniques to exploit special features — it is perhaps more 
a principle than a method. 

At the end of Section 2.2.1 we noted that one attractive feature of the 
inverse transform method is that it uses exactly one uniform random vari- 
able per nonuniform random variable generated. When simulation problems 
are formulated as numerical integration problems, the dimension of the in- 
tegrand is typically the maximum number of uniform variables needed to 
generate a simulation “path.” The effectiveness of quasi-Monte Carlo and re- 
lated integration methods generally deteriorates as the dimension increases, 
so in using those methods, we prefer representations that keep the dimen- 
sion as small as possible. With an acceptance-rejection method, there is or- 
dinarily no upper bound on the number of uniforms required to generate 
even a single nonuniform variable; simulations that use acceptance-rejection 
therefore correspond to infinite-dimensional integration problems. For this 
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reason, acceptance-rejection methods are generally inapplicable with quasi- 
Monte Carlo methods. A further potential drawback of acceptance-rejection 
methods, compared with the inverse transform method, is that their outputs 
are generally neither continuous nor monotone functions of the input uni- 
forms. This can diminish the effectiveness of the antithetic variates method, 


for example. 


2.3 Normal Random Variables and Vectors 


Normal random variables are the building blocks of many financial simula- 
tion models, so we discuss methods for sampling from normal distributions in 
detail. We begin with a brief review of basic properties of normal distributions. 


2.3.1 Basic Properties 


The standard univariate normal distribution has density 


—0o0 < T < 00 (2.18) 


and cumulative distribution function 


1 = 2 
D r= Al Ee du, (2.19) 


Standard indicates mean 0 and variance 1. More generally, the normal distri- 
bution with mean py and variance o7, o > 0, has density 


1  (w—n)? 
e 202 


Piko (x) = 


270 


and cumulative distribution 


&,,,(z) =® (= z £) 


O 


The notation X ~ N(u,o?) abbreviates the statement that the random vari- 


able X is normally distributed with mean p and o”. 
If Z ~ N(0,1) i.e., Z has the standard normal distribution), then 


utoZ~ N(u, 0°). 


Thus, given a method for generating samples Z1, Z2,... from the standard 
normal distribution, we can generate samples X1, X2,... from N (u, 0o?) by 
setting X; = u + oZ;. It therefore suffices to consider methods for sampling 


from N (0,1). 
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A d-dimensional normal distribution is characterized by a d-vector u and 
a d x d covariance matrix X; we abbreviate it as N(u,X). To qualify as a 
covariance matrix, © must be symmetric (i.e., © and its transpose >! are 
equal) and positive semidefinite, meaning that 


a' Da >0 (2.20) 


for all x € R%. This is equivalent to the requirement that all eigenvalues of © be 
nonnegative. (As a symmetric matrix, © automatically has real eigenvalues.) 
If X is positive definite (meaning that strict inequality holds in (2.20) for all 
nonzero x € R? or, equivalently, that all eigenvalues of X are positive), then 
the normal distribution N (u, }&) has density 


E 1 1 Ty -l d 
uyole) = Orij exp (—4(z oP u) Ss (z£ = u)) 5 wee R ; (2.21) 
with || the determinant of ©. The standard d-dimensional normal N(0, Iq), 
with I4 the d x d identity matrix, is the special case 


1 
(Qn)4/2 exp (—42'z) : 


If X ~ N(u, X) (i.e., the random vector X has a multivariate normal 
distribution), then its ith component X; has distribution N (u;, o2), with o? = 
az. The tth and jth components have covariance 


Cov[X;, X53] = E|(X; — mi) X; — uj)| = Xiz, 


which justifies calling % the covariance matrix. The correlation between X; 
and X; is given by 

_ Žij 

ea 

O40; 
In specifying a multivariate distribution, it is sometimes convenient to use this 
definition in the opposite direction: specify the marginal standard deviations 
o;,1=1,...,d, and the correlations p;; from which the covariance matrix 


Die = 0105 Pij (2.22) 


is then determined. 
If the d x d symmetric matrix is positive semidefinite but not positive 


definite then the rank of X is less than d, » fails to be invertible, and there 
is no normal density with covariance matrix X. In this case, we can define 
the normal distribution N(u, =) as the distribution of X = u + AZ with 
Z ~ N(0, Ia) for any d x d matrix A satisfying AA! = 5. The resulting 
distribution is independent of which such A is chosen. The random vector 
X does not have a density in R?, but if © has rank k then one can find k 
components of X with a multivariate normal density in #*. 
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Three further properties of the multivariate normal distribution merit spe- 
cial mention: 


Linear Transformation Property: Any linear transformation of a normal 
vector is again normal: 


X ~ N(p,%) > AX ~ N(Ap, ADA"), (2.23) 


for any d-vector u, and d x d matrix X, and any k x d matrix A, for any k. 


Conditioning Formula: Suppose the partitioned vector (Xu), X,[2}) (where 
each Xp) may itself be a vector) is multivariate normal with 


a (ta Ge Dia p 

N 2.24 
& [72] d21) %4[29] ae 
and suppose +799) has full rank. Then 


(XmlXa = 2) ~ N (un + Eua Epo (@ — ua) Suy -Engga (2-25) 


In (2.24), the dimensions of the up) and Xp) are consistent with those of 
the Xj. Equation (2.25) then gives the distribution of X;,) conditional on 
X19] = T. 
Moment Generating Function: If X ~ N(u, ©) with X d-dimensional, 
then 

Efexp(0! X)] = exp (u' 0 + 46' D6) (2.26) 


for all 8 € R°. 


2.3.2 Generating Univariate Normals 


We now discuss algorithms for generating samples from univariate normal 
distributions. As noted in the previous section, it suffices to consider sam- 
pling from N (0,1). We assume the availability of a sequence U1, U2,... of 
independent random variables uniformly distributed on the unit interval [0, 1] 
and consider methods for transforming these uniform random variables to 
normally distributed random variables. 


Box-Muller Method 


Perhaps the simplest method to implement (though not the fastest or neces- 
sarily the most convenient) is the Box-Muller [51] algorithm. This algorithm 
generates a sample from the bivariate standard normal, each component of 
which is thus a univariate standard normal. The algorithm is based on the 
following two properties of the bivariate normal: if Z ~ N (0, I2), then 
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(i) R= Z? + Zé is exponentially distributed with mean 2, i.e., 
P(R < £) = 1 — e™”/?, 


(ii) given R, the point (Z1, Z2) is uniformly distributed on the circle of radius 
VR centered at the origin. 


Thus, to generate (Z1, Z2), we may first generate R and then choose a point 
uniformly from the circle of radius VR. To sample from the exponential dis- 
tribution we may set R = —2log(U;), with Uı ~ Unif[0,1], as in (2.15). To 
generate a random point on a circle, we may generate a random angle uni- 
formly between 0 and 27 and then map the angle to a point on the circle. 
The random angle may be generated as V = 27U2, U2 ~ Unif[0,1]; the cor- 
responding point on the circle has coordinates (V R cos(V), VRsin(V)). The 
complete algorithm is given in Figure 2.10. 


generate U1, U2 independent Unif[0,1] 
Re —2 log(U1) 
V «+ 22U2 


Zı — VRoos(V), Z2 — VRsin(V) 
return 21, Z2. 


Fig. 2.10. Box-Muller algorithm for generating normal random variables. 


Marsaglia and Bray [250] developed a modification of the Box-Muller 
method that reduces computing time by avoiding evaluation of the sine and co- 
sine functions. ‘The Marsaglia-Bray method instead uses acceptance-rejection 
to sample points uniformly in the unit disc and then transforms these points 
to normal variables. i 

The algorithm is illustrated in Figure 2.11. The transformation U; — 2U; — 
1,7 = 1,2, makes (U1, U2) uniformly distributed over the square [—1,1] x 
[—1,1]. Accepting only those pairs for which X = U? + U2 is less than or 
equal to 1 produces points uniformly distributed over the disc of radius 1 
centered at the origin. Conditional on acceptance, X is uniformly distributed 
between 0 and 1, so the log X in Figure 2.11 has the same effect as the log U; 
in Figure 2.10. Dividing each accepted (U1, U2) by VX projects it from the 
unit disc to the unit circle, on which it is uniformly distributed. Moreover, 
(U,/VX,U2/VX) is independent of X conditional on X < 1. Hence, the 
justification for the last step in Figure 2.11 is the same as that for the Box- 
Muller method. 

As is the case with most acceptance-rejection methods, there is no upper 
bound on the number of uniforms the Marsaglia-Bray algorithm may use 
to generate a single normal variable (or pair of variables). This renders the 
method inapplicable with quasi-Monte Carlo simulation. 
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while (X > 1) 
generate U1, U2 ~ Unif[0,1] 
Uy — 2x U;, —1, U2 — 2 x Uo — 1 
Mee PU 


end 


Y — 4/—2log X/X 


Z1 — UY, Zo = UY 
return 71, Z2. 


Fig. 2.11. Marsaglia-Bray algorithm for generating normal random variables. 


Approximating the Inverse Normal 


Applying the inverse transform method to the normal distribution entails 
evaluation of #71. At first sight, this may seem infeasible. However, there is 
really no reason to consider ®~! any less tractable than, e.g., a logarithm. 
Neither can be computed exactly in general, but both can be approximated 
with sufficient accuracy for applications. We discuss some specific methods 


for evaluating ®~!. 
Because of the symmetry of the normal distribution, 


®@'(1-u)=-G "(u), O<u<il; 


it therefore suffices to approximate ®~! on the interval (0.5, 1) (or the interval 
(0,0.5]) and then to use the symmetry property to extend the approximation 
to the rest of the unit interval. Beasley and Springer [43] provide a rational 
approximation 
3 1\2n+1 
nanlu- s 

Piu) & OF aa | ala (2.27) 

1 pen ae 
for 0.5 < u < 0.92, with constants an, bn given in Figure 2.12; for u > 0.92 they 


use a rational function of ,/log(1 — u). Moro [271] reports greater accuracy in 
the tails by replacing the second part of the Beasley-Springer approximation 


with a Chebyshev approximation 
8 
Pi (u) ~ g(u) = > cnflog(—log(1 — u))]”, 0.92 <u<1, (2.28) 
n=0 


with constants c, again given in Figure 2.12. Using the symmetry rule, this 
gives 

@-'(u) = —g(1—u) O<u<.08. 
With this modification, Moro [271] finds a maximum absolute error of 3 x 107° 


out to seven standard deviations (i.e., over the range ®(—7) < u < ®(7)). The 
combined algorithm from Moro [271] is given in Figure 2.13. 
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ao = 2.00662823884 bo = -8.47351093090 
ai = -18.61500062529 bı = 23.08336743743 
a2 = 41.39119773534 b2 = -21.06224101826 


3.13082909833 


a3 = -25.44106049637 b3 


co = 0.3374754822726147 cs = 0.0003951896511919 
cı = 0.9761690190917186 ce = 0.0000321767881768 
c2 = 0.1607979714918209 c7 = 0.0000002888167364 
c3 = 0.0276438810333863 cg = 0.0000003960315187 
ca = 0.0038405729373609 


Fig. 2.12. Constants for approximations to inverse normal. 


Input: u between 0 and 1 
Output: z, approximation to 6~*(u). 
y u -— 0.5 
if |y| < 0.42 
Cs Y x yY 
x 4| yx (((a3 xr +az) xr +ai)*r + ao)/ 
((((b3 xr +b2)*r+bi) *r+bo) *r+1) 


else 
r u; 
if (y >0)r—1-—u 
r — log(— log(r)) 
T 4— Co +r x (cr +r * (Co +r * (Cg +r * (Cat 
r x (cs +r x» (cs +r x»(e7+rxces))DD 
if (y < 0) z — -r 
return zx 


Fig. 2.13. Beasley-Springer-Moro algorithm for approximating the inverse normal. 


The problem of computing ®~'(u) can be posed as one of finding the root 
x of the equation ®(z) = u and in principle addressed through any general 
root-finding algorithm. Newton’s method, for example, produces the iterates 


P(n) — u 
or, more explicitly, 
Tn41 = Zn + (u — P(zn))exp(—0.5£n : En +0), c= log(vV2r). 
Marsaglia, Zaman, and Marsaglia [251] recommend the starting point 
ro = +y/] — 1.6log(1.0004 — (1 — 2u)2)], 


the sign depending on whether u > 0 or u < 0. This starting point gives a 
surprisingly good approximation to ®~'(u). A root-finding procedure is use- 
ful when extreme precision is more important than speed — for example; in 
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tabulating “exact” values or evaluating approximations. Also, a small num- 
ber of Newton steps can be appended to an approximation like the one in 
Figure 2.13 to further improve accuracy. Adding just a single step to Moro’s 
[271] algorithm appears to reduce the maximum error to the order of 107'°. 


Approximating the Cumulative Normal 


Of course, the application of Newton’s method presupposes the ability to 
evaluate ® itself quickly and accurately. Evaluation of the cumulative normal 
is necessary for many financial applications (including evaluation of the Black- 
Scholes formula), so we include methods for approximating this function. We 
present two methods; the first is faster and the second is more accurate, but 
both are probably fast enough and accurate enough for most applications. 

The first method, based on work of Hastings [171], is one of several included 
in Abramowitz and Stegun [3]. For x > 0, it takes the form 


1 
B(x) & 1 — d(x) (bit + bot? + bgt? + bat* + bst®), *¢ = ——, 
(x) p(x) (by at” + bgt” + bat” + bst?) Lege 
for constants b; and p. The approximation extends to negative arguments 
through the identity ®(—x) = 1 — (x). The necessary constants and an 
explicit algorithm for this approximation are given in Figure 2.14. According 
to Hastings [171, p.169], this method has a maximum absolute error less than 


7.5 x 1078. 


bı = 0.319381530 p = 0.2316419 
b2 = —0.356563782 c = log(V/27) = 0.918938533204672 
bs = 1.781477937 

b4 = —1.821255978 

bs = 1.330274429 


Input: x 


Output: y, approximation to ®(z) 

a — |x| 

t<—1/(1+a*p) 

s — ((((bs t+ b4) xt + b3) xt + bo) xt +61) xt 
y — sxexp(—0.5*2* x —C) 

if (x >0) y= 1-—y 

return y; 


Fig. 2.14. Hastings’ [171] approximation to the cumulative normal distribution as 
modified in Abramowitz and Stegun [3]. 


The second method we include is from Marsaglia et al. [251]. Like the 
Hastings approximation above, this method is based on approximating the 
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ratio (1—®(x))/¢(x). According to Marsaglia et al. [251], as an approximation 
to the tail probability 1 — (x) this method has a maximum relative error of 
107+ for 0 < x < 6.23025 and 1071? for larger x. (Relative error is much more 
stringent than absolute error in this setting; a small absolute error is easily 
achieved for large x using the approximation 1— (xz) = 0.) This method takes 
about three times as long as the Hastings approximation, but both methods 
are very fast. The complete algorithm appears in Figure 2.15. 


vı = 1.253314137315500 vo = 0.1231319632579329 
v2 = 0.6556795424187985 vi9 = 0.1097872825783083 
v3 = 0.4213692292880545 v11 = 0.09902859647173193 
v4 = 0.3045902987101033 v12 = 0.09017567550106468 
vs = 0.2366523829135607 v13 = 0.08276628650136917 
ve = 0.1928081047153158 via = 0.0764757610162485 
v7 = 0.1623776608968675 v15 = 0.07106958053885211 
us = 0.1401041834530502 

c = log(V/27) = 0.918938533204672 


Input: x between -15 and 15 
Output: y, approximation to ®(z). 

j — |min(|x| + 0.5, 14) | 

zj, helztl—z, acevjs4i 
b—zxa—1l, q1, seathxb 
for i =2,4,6,..., 24 — j 

a <— (a+ zxb)/i 

b— (b+z*xa)}/(i +1) 
q—q*xhx*h 

s— s+q*(a+hx*b) 

end 

y = s x exp(—0.5 * £ * x — c) 

if (xz >0)y—1-—-y 

return y 


Fig. 2.15. Algorithm of Marsaglia et al. [251] to approximate the cumulative normal 
distribution. 


Marsaglia et al. [251] present a faster approximation achieving similar 
accuracy but requiring 121 tabulated constants. Marsaglia et al. also detail 
the use of accurate approximations to ® in constructing approximations to 
-! by tabulating “exact” values at a large number of strategically chosen 
points. Their method entails the use of more than 2000 tabulated constants, 
but the constants can be computed rather than tabulated, given an accurate 
approximation to ®. 

Other methods for approximating @ and ®~! found in the literature are 
often based on the error function 
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Erf(x) = =l et dt 


and its inverse. Observe that for x > 0, 
Erf(z) = 26(¢V2)—1, (x) = 2[Erf(a/V2) +1] 


d 
om 1 u+ 1 


V2 2 
so approximations to Erf and its inverse are easily converted into approxima- 
tions to ® and its inverse. Hastings [171], in fact, approximates Erf, so the 
constants in Figure 2.14 (as modified in [3]) differ from his, with p smaller 
and the b; larger by a factor of v2. 

Devroye [95] discusses several other methods for sampling from the nor- 
mal distribution, including some that may be substantially faster than evalu- 
ation of #71. Nevertheless, as discussed in Section 2.2.1, the inverse transform 
method has some advantages — particularly in the application of variance re- 
duction techniques and low-discrepancy methods — that will often justify the 
additional computational effort. One advantage is that the inverse transform 
method requires just one uniform input per normal output: a relevant notion 
of the dimension of a Monte Carlo problem is often the maximum number 
of uniforms required to generate one sample path, so methods requiring more 
uniforms per normal sample implicitly result in higher dimensional represen- 
tations. Another useful property of the inverse transform method is that the 
mapping ut> ®~!(u) is both continuous and monotone. These properties can 
sometimes enhance the effectiveness of variance reduction techniques, as we 
will see in later sections. 


Erf~'(u) = —-671/ ), lu) = V2Erf—*(2u — 1), 


2.3.3 Generating Multivariate Normals 


A multivariate normal distribution N (u, ©) is specified by its mean vector u 
and covariance matrix X. The covariance matrix may be specified implicitly 
through its diagonal entries a? and correlations p;; using (2.22); in matrix 
form, 


O71 P11 P12 °°* Pld O1 
02 P12 P22 P2d 02 
a = ; 
Od Pld P2d°** Pda Od 


From the Linear Transformation Property (2.23), we know that if Z ~ 
N(0,I) and X = u + AZ, then X ~ N(u, AA'). Using any of the methods 
discussed in Section 2.3.2, we can generate independent standard normal ran- 
dom variables Z;,..., Za and assemble them into a vector Z ~ N(0, J). Thus, 
the problem of sampling X from the multivariate normal N (pu, X) reduces to 
finding a matrix A for which AA' = ¥. 
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Cholesky Factorization 


Among all such A, a lower triangular one is particularly convenient because 
it reduces the calculation of u + AZ to the following: 


Xı = fy + AZ 
Xə = u + Ag Z1 + Ao2Z2 


Xa = Ha + Aa Z1 + Aa2Z2 +++: + AgaZa. 


A full multiplication of the vector Z by the matrix A would require approx- 
imately twice as many multiplications and additions. A representation of X 
as AA! with A lower triangular is a Cholesky factorization of X. If © is pos- 
itive definite (as opposed to merely positive semidefinite), it has a Cholesky 
factorization and the matrix A is unique up to changes in sign. 

Consider a 2 x 2 covariance matrix h, represented as 


( of a 

>, = 2 ‘ 
0102p 0% 
Assuming gı > 0 and o2 > 0, the Cholesky factor is 


a, 0 


A= | 
bel Vi Sie) 


as is easily verified by evaluating AA'. Thus, we can sample from a bivariate 
normal distribution N (u, ©) by setting 


X, =p +0121 
Xə = po + 02pZ1 + 02V 1 — p*Za, 


with Z1, Z2 independent standard normals. 
For the case of a d x d covariance matrix X, we need to solve 


Ait Aji Agi ++: Aq 
Aoi Aoo A22 es Aa? 

e n; =}. 
Aq Aaa +++ Add Adda 


Traversing the 4; by looping over 7 = 1,...,d and then i = j,...,d produces 
the equations 
Aj, = Di 
Agi Ai. = X21 
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AgAi1 = Dai (2.29) 
As, + Ady = Dae 


Aj, tes) + ASg = Daa: 


Exactly one new entry of the A matrix appears in each equation, making it 
possible to solve for the individual entries sequentially. 
More compactly, from the basic identity 


F 
Dij = S Ain Aje, Jai 


k=l 
we get 
Jai 
Aij = (=, - S ind) /Ajj, J <4, (2.30) 
k=l 
and 


(2.31) 


These expressions make possible a simple recursion to find the Cholesky factor. 
Figure 2.16 displays an algorithm based on one in Golub and Van Loan [162]. 
Golub and Van Loan [162] give several other versions of the algorithm and 
also discuss numerical stability. 


Input: Symmetric positive definite matrix d x d matrix X 
Output: Lower triangular A with AA' = 5 


A — 0 (dx d zero matrix) 
lay e E 


i) a aa 
Vi — Pij 
for k = 1,...,j— 1 
vi — vi — Apr Aix 
Aij — vif yV 


return A 


Fig. 2.16. Cholesky factorization. 
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The Semidefinite Case 


If X is positive definite, an induction argument verifies that the quantity 
inside the square root in (2.31) is strictly positive so the A;; are nonzero. This 
ensures that (2.30) does not entail division by zero and that the algorithm in 
Figure 2.16 runs to completion. 

If, however, XJ is merely positive semidefinite, then it is rank deficient. It 
follows that any matrix A satisfying AA! = © must also be rank deficient; for 
if A had full rank, then © would too. If A is lower triangular and rank deficient, 
at least one element of the diagonal of A must be zero. (The determinant of a 
triangular matrix is the product of its diagonal elements, and the determinant 
of A is zero if A is singular.) Thus, for semidefinite %, any attempt at Cholesky 
factorization must produce some A,; = 0 and thus an error in (2.31) and the 
algorithm in Figure 2.16. 

From a purely mathematical perspective, the problem is easily solved by 
making the jth column of A identically zero if A;; = 0. This can be deduced 
from the system of equations (2.29): the first element of the jth column of A 
encountered in this sequence of equations is the diagonal entry; if Aj; = 0, all 
subsequent equations for the jth column of X may be solved with A;; = 0. In 
the factorization algorithm of Figure 2.16, this is accomplished by inserting 
“if v; > 0” before the statement “A;; — v;/ /v;.” Thus, if v; = 0, the entry 
Aj; is left at its initial value of zero. 

In practice, this solution may be problematic because it involves checking 
whether an intermediate calculation (v;) is exactly zero, making the modified 
algorithm extremely sensitive to round-off error. 

Rather than blindly subjecting a singular covariance matrix to Cholesky 
factorization, it is therefore preferable to use the structure of the covariance 
matrix to reduce the problem to one of full rank. If X ~ N(0,%) and the dxd 
matrix X has rank k < d, it is possible to express all d components of X as 
linear combinations of just k of the components, these k components having a 
covariance matrix of rank k. In other words, it is possible to find a subvector 
X = (Xi,,...,Xj,) and a d x k matrix D such that DX ~ N(0, £) and for 
which the covariance matrix © of X has full rank k. Cholesky factorization 
can then be applied to © to find A satisfying AA! = È. The full vector X 
can be sampled by setting X = DAZ, Z ~ N(0,1). 

Singular covariance matrices often arise from factor models in which a 
vector of length d is determined by k < d sources of uncertainty (factors). 
In this case, the prescription above reduces to using knowledge of the factor 


structure to generate X. 


Eigenvector Factorization and Principal Components 


The equation AA' = E can also be solved by diagonalizing ©. As a symmetric 
d x d matrix, & has d real eigenvalues A;,...,Aq, and because X must be 
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positive definite or semidefinite the A; are nonnegative. Furthermore, & has an 
associated orthonormal set of eigenvectors {v1,..., Ua}; i.e., vectors satisfying 


wy=l, vivj=0, gA 1,9 =1,...,d, 


and 

DU; = AGU; - 
It follows that © = VAV', where V is the orthogonal matrix (VV! = J) 
with columns vj,...,vg and A is the diagonal matrix with diagonal entries 
A1,- -., Aa. Hence, if we choose 


VA2 
A=VAM2=V i (2.32) 


then 
AA! =VAV'! =». 


Methods for calculating V and A are included in many mathematical software 
libraries and discussed in detail in Golub and Van Loan [162]. 

Unlike the Cholesky factor, the matrix A in (2.32) has no particular struc- 
ture providing a computational advantage in evaluating AZ, nor is this matrix 
faster to compute than the Cholesky factorization. The eigenvectors and eigen- 
values of a covariance matrix do however have a statistical interpretation that 
is occasionally useful. We discuss this interpretation next. 

If X ~ N(O,%) and Z ~ N(0,J), then generating X as AZ for any choice 


of A means setting 
X = 0121 +022 ++: + aaZa 


where a; is the jth column of A. We may interpret the Z; as independent 
factors driving the components of X, with A;,; the “factor loading” of Z; on 
Xi. If X has rank 1, then X may be represented as a;Z, for some vector a1, 
and in this case a single factor suffices to represent X. If X has rank k, then 
k factors Z1,..., Z, suffice. 

If © has full rank and AA! = E, then A must have full rank and X = AZ 
implies Z = BX with B = A~'. Thus, the factors Z; are themselves linear 
combinations of the X;. In the special case of A given in (2.32), we have 


AeA yS (2.33) 


because V'V = J (V is orthogonal). It follows that Z; is proportional to 
vj X , where v; is the jth column of V and thus an eigenvector of X. 

The factors Z; constructed proportional to the vj X are optimal in a 
precise sense. Suppose we want to find the best single-factor approximation 
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to X; i.e., the linear combination w! X that best captures the variability of 
the components of X. A standard notion of optimality chooses w to maximize 
the variance of w! X, which is given by w! Xw. Since this variance can be 
made arbitrarily large by multiplying any w by a constant, it makes sense to 
impose a normalization through a constraint of the form w'w = 1. We are 


thus led to the problem 


max w! dw. 
w:w T w=1 


If the eigenvalues of X are ordered so that 
Àl Z A2 eee 2 Aa, 


then this optimization problem is solved by vı, as is easily verified by ap- 
pending the constraint with a Lagrange multiplier and differentiating. (This 
optimality property of eigenvectors is sometimes called Rayleigh’s principle.) 
The problem of finding the next best factor orthogonal to the first reduces to 


solving 
max w' Dw. 
w:w l w=1,wT v1 =0 
This optimization problem is solved by vg. More generally, the best k-factor 
approximation chooses factors proportional to vl X, ud P S vl X . Since 


T lna . 


normalizing the v] X to construct unit-variance factors yields 


1 
ui X, 


Vx" 


which coincides with (2.33). The transformation X = AZ recovering X from 
the Z; is precisely the A in (2.32). 

The optimality of this representation can be recast in the following way. 
Suppose that we are given X and that we want to find vectors a1,..., ak in 
Rİ and unit-variance random variables Z,,...,Z, in order to approximate X 
by a, 2, +---+azZ,. For any k = 1,...,d, the mean square approximation 


error 


Z; = 


k 
E ix - yaa , (We? = 272) 


is minimized by taking the a; to be the columns of A in (2.32) and setting 
Li = v! X / VAi ‘ 

In the statistical literature, the linear combinations v; X are called the 
principal components of X (see, e.g., Seber [325]). We may thus say that the 
principal components provide an optimal lower-dimensional approximation to 
a random vector. The variance explained by the first k principal components 


is the ratio 
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oie Sasa ee (2.34) 
Al eet Akte Aa 

in particular, the first principal component is chosen to explain as much vari- 
ance as possible. In simulation applications, generating X from its principal 
components (i.e., using (2.32)) is sometimes useful in designing variance re- 
duction techniques. In some cases, the principal components interpretation 
suggests that variance reduction should focus first on Z,, then on Z2, and so 
on. We will see examples of this in Chapter 4 and related ideas in Section 5.5. 


3 
Generating Sample Paths 


This chapter develops methods for simulating paths of a variety of stochastic 
processes important in financial engineering. The emphasis in this chapter is 
on methods for ezact simulation of continuous-time processes at a discrete 
set of dates. The methods are exact in the sense that the joint distribution of 
the simulated values coincides with the joint distribution of the continuous- 
time process on the simulation time grid. Exact methods rely on special fea- 
tures of a model and are generally available only for models that offer some 
tractability. More complex models must ordinarily be simulated through, e.g., 
discretization of stochastic differential equations, as discussed in Chapter 6. 

The examples covered in this chapter are arranged roughly in increasing or- 
der of complexity. We begin with methods for simulating Brownian motion in 
one dimension or multiple dimensions and extend these to geometric Brownian 
motion. We then consider Gaussian interest rate models. Our first real break 
from Gaussian processes comes in Section 3.4, where we treat square-root dif- 
fusions. Section 3.5 considers processes with jumps as models of asset prices. 
Sections 3.6 and 3.7 treat substantially more complex models than the rest of 
the chapter; these are interest rate models that describe the term structure 
through a curve or vector of forward rates. Exact simulation of these models is 
generally infeasible; we have included them here because of their importance 
in financial engineering and because they illustrate some of the complexities 
of the use of simulation for derivatives pricing. 


3.1 Brownian Motion 


3.1.1 One Dimension 


By a standard one-dimensional Brownian motion on [0, T], we a mean a sto- 
chastic process {W(t),0 < t < T} with the following properties: 


(i) W(0) = 0; 
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(ii) the mapping t +> W(t) is, with probability 1, a continuous function on 
[0, T]; 

(iii) the increments {W (t1) — W (to), W (t2) — W (t1), ..., W (tk) — W (tk—1)} 
are independent for any k and any 0 < to < ti < -< tk <T; 

(iv) W(t) -W(s) ~ N(0,t— s) for anyO<s<t<T. 

In (iv) it would suffice to require that W(t) — W (s) have mean 0 and variance 

t — s; that its distribution is in fact normal follows from the continuity of 

sample paths in (ii) and the independent increments property (iii). We include 

the condition of normality in (iv) because it is central to our discussion. A 

consequence of (i) and (iv) is that 


W(t) ~ N(0,t), (3.1) 


for0O<t<T. 
For constants u and o > 0, we call a process X (t) a Brownian motion with 


drift u and diffusion coefficient o? (abbreviated X ~ BM(, o7)) if 
X(t) — pt 
o 
is a standard Brownian motion. Thus, we may construct X from a standard 
Brownian motion W by setting 
X = pt +oW (t). 


It follows that X(t) ~ N(pt,o07t). Moreover, X solves the stochastic differen- 
tial equation (SDE) 
dX(t) = udt + o dW (t). 

The assumption that X (0) = 0 is a natural normalization, but we may con- 
struct a Brownian motion with parameters u and g? and initial value z by 
simply adding x to each X(t). 

For deterministic but time-varying u(t) and o(t) > 0, we may define a 
Brownian motion with drift u and diffusion coefficient o? through the SDE 


dX (t) = p(t) dt + o(t)dW (t); 
L.e., through 


X(t) =X (0) +j p(s) ds -j o(s) dW(s), 


vith X (0) an arbitrary constant. The process X has continuous sample paths 
und independent increments. Each increment X (t)— X (s) is normally distrib- 
ited with mean 


E[X(t) — X(s)] = / pi(us) du 


ind variance 


Varl x(t) = X(s)] = Var | f otaw] = f otau 
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Random Walk Construction 


In discussing the simulation of Brownian motion, we mostly focus on simulat- 
ing values (W(t1),...,W(tn)) or (X(t1),...,X(tn)) at a fixed set of points 
0 < ti <: < tn. Because Brownian motion has independent normally dis- 
tributed increments, simulating the W (t;) or X (ti) from their increments is 
straightforward. Let Z1,..., Zn be independent standard normal random vari- 
ables, generated using any of the methods in Section 2.3.2, for example. For 
a standard Brownian motion set tọ = 0 and W (0) = 0. Subsequent values can 
be generated as follows: 


W (ti41) = W (ti) + Jt — tiZi+i, t1=0,...,n—-1. (3.2) 
For X ~ BM(u, 0°) with constant u and o and given X (0), set 
X (ti+1) = X (t;) + u(ti+ı — t;) +o tji — ti Zi+1, a =0,...,n—- L. (3.3) 


With time-dependent coefficients, the recursion becomes 


bit. tit 
Xtand = X(t) + f u(s)ds + J o? (u)duZi+ı, t=O0,...,n—-1. 
t ti 

(3.4) 
The methods in (3.2)-(3.4) are exact in the sense that the joint distri- 
bution of the simulated values (W (t1), ..., W (tn)) or (X (t1), --., X(tn)) co- 
incides with the joint distribution of the corresponding Brownian motion at 
t1,...,t,. Of course, this says nothing about what happens between the t;. 
One may extend the simulated values to other time points through, e.g., piece- 
wise linear interpolation; but no deterministic interpolation method will give 
the extended vector the correct joint distribution. The methods in (3.2)—(3.4) 
are exact at the time points f1,...,t, but subject to discretization error, com- 
pared to a true Brownian motion, if deterministically interpolated to other 

time points. Replacing (3.4) with the Euler approximation 


i 


X (tiga) = X (ti) + w(ti) (tiga — ti) + ofti) y ti+i — tiZi+i, 1=0,...,2—1, 


will in general introduce discretization error even at t1,...,tn, because the 
increments will no longer have exactly the right mean and variance. We return 
to the topic of discretization error in Chapter 6. 

The vector (W(ti),...,W(tn)) is a linear transformation of the the vec- 
tor of increments (W (t1), W (t2) — W (t1), ..., W (tn) — W (tn-1)). Since these 
increments are independent and normally distributed, it follows from the Lin- 
ear Transformation Property (2.23) that (W (t1), ..., W (tn)) has a multivari- 
ate normal distribution. Simulating (W (t1), ..., W (tn)) is thus a special case 
of the general problem, treated in Section 2.3.3, of generating multivariate 
normal vectors. While the random walk construction suffiċes for most appli- 
cations, it is interesting and sometimes useful to consider alternative sampling 


methods. 
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To apply any of the methods considered in Section 2.3.3, we first need 
to find the mean vector and covariance matrix of (W(t1),...,W(t,)). For a 
standard Brownian motion, we know from (3.1) that E[W(t,)] = 0, so the 
mean vector is identically 0. For the covariance matrix, consider first any 
O0<s<t< T; using the independence of the increments we find that 


Cov[W (s), W(t)] = Cov[W(s), W(s) + (W(t) — W(s)) 
= Cov|W (s), W(s)] + Cov[W(s), W(t) — W(s)] 
=s+O=<s. (3.5) 


Letting C denote the covariance matrix of (W(t1),...,W(tn)), we thus have 


Ce = min(t;,t,). (3.6) 


Cholesky Factorization 


Having noted that the vector (W (t;),..., W(tn)) has the distribution N (0, C), 
with C as in (3.6), we may simulate this vector as AZ, where Z = (Z1, ... , Zn)! 
~ N(0, T) and A satisfies AA!’ = C. The Cholesky method discussed in Sec- 
sion 2.3.3 takes A to be lower triangular. For C in (3.6), the Cholesky factor 


s given by 
vi 0 > 0 
Vti vto >ti 0 


? 


vti y t2 =b] a V tn —ty-1 
is can be verified through calculation of AA'. In this case, generating 
W(t1),...,W(t,)) as AZ is simply a matrix-vector representation of the 
ecursion in (3.2). Put differently, the random walk construction (3.2) may be 
iewed as an efficient implementation of the product AZ. Even exploiting the 
ower triangularity of A, evaluation of AZ is an O(n?) operation; the random 
ralk construction reduces this to O(n) by implicitly exploiting the fact that 
he nonzero entries of each column of A are identical. 

For a BM(u, o?) process X, the mean vector of (X(t,),...,X(tn)) has 
sh component pt; and the covariance matrix is o*C. The Cholesky factor 
; CÅ and we once again find that the Cholesky method coincides with the 
icrement recursion (3.3). 


irownian Bridge Construction 


he recursion (3.2) generates the vector (W (t1),..., W(tn)) from left to right. 
/e may however generate the W(t;) in any order we choose, provided that 
; each step we sample from the correct conditional distribution given the 
ilues already generated. For example, we may first generate the final value 
(tn), then sample W (¢),,/2;) conditional on the value of W (tn), and proceed 
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by progressively filling in intermediate values. This flexibility can be useful in 
implementing variance reduction techniques and low-discrepancy methods. It 
follows from the Conditioning Formula (2.24) that the conditional distribu- 
tion needed at each step is itself normal and this makes conditional sampling 
feasible. 

Conditioning a Brownian motion on its endpoints produces a Brownian 
bridge. Once we determine W (tn), filling in intermediate values amounts to 
simulating a Brownian bridge from 0 = W(0) to W (tn). If we next sample 
W (tin/2|), then filling in values between times t),,/2) and tn amounts to sim- 
ulating a Brownian bridge from W(t),/2)) to W (tn). This approach is thus 
referred to as a Brownian bridge construction. 

As a first step in developing this construction, suppose 0 << u<s<t 
and consider the problem of generating W (s) conditional on W(u) = x and 
W(t) = y. We use the Conditioning Formula (2.24) to find the conditional 
distribution of W(s). We know from (3.5) that the unconditional distribution 
is given by 


W(s) |~N10,[ uss 
W(t) ust 


The Conditioning Formula (2.24) gives the distribution of the second compo- 
nent of a partitioned vector conditional on a value of the first component. We 
want to apply this formula to find the distribution of W (s) conditional on the 
value of (W (u), W(t)). We therefore first permute the entries of the vector to 
get 


W(s) sus 
Wi(u) }~N{O,f uuu 
W (t) sut 


We now find from the Conditioning Formula that, given (W (u) = x, W (t) = 
y), W(s) is normally distributed with mean 


E[W(s)|W(u) = 2, W (t) = y] = 


1-09 (32) (GRE an 


and variance 


since 


In particular, the conditional mean (3.7) is obtained by linearly interpolating 
between (u, x) and (t, y). 
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Suppose, more generally, that the values W(s1) = 21, W(s2) = £2, ..., 
W (sk) = £p of the Brownian path have been determined at the times sı < 
S2 <- < 8, and that we wish to sample W (s) conditional on these values. 
Suppose that s; < s < 8;41. Then 


(W(s)|W(s;) = £j, j =1,...,k) = (W(s)|W (ss) = £i, W(si41) = £141), 


in the sense that the two conditional distributions are the same. This can again 
be derived from the Conditioning Formula (2.24) but is more immediate from 
the Markov property of Brownian motion (a consequence of the independent 
increments property): given W(s;), W(s) is independent of all W(t) with 
t < si, and given W (si+1) it is independent of all W(t) with t > si+1. Thus, 
conditioning on all W (s;) is equivalent to conditioning on the values of the 
Brownian path at the two times s; and s;+ı closest to s. Combining these 
observations with (3.7) and (3.8), we find that 


(W(s)|W(s1) = 21,W(s2) = %2,...,W(sp) = 2%) = 


N (== asle + (8 — $4)€i41 (Si41 — 8)(8 — =) 
(Si41 a Si) l (Sai a Si) 
Chis is illustrated in Figure 3.1. The conditional mean of W (s) lies on the 
ine segment connecting (si, £;) and (Si+1, 2:41); the actual value of W (s) is 


iormally distributed about this mean with a variance that depends on (s — s;) 
nd (si+ı — s). To sample from this conditional distribution, we may set 


_ (Siti — S)Ti + (8 ~ Si) Tini 


Cee 8)(S — 8) 
ia = Si) a 


(Si+1 Ea Si) 


W(s) 


5) 


ith Z ~ N(0,1) independent of all W(s1),...,W(sx). 

By repeatedly using these observations, we may indeed sample the com- 
onents of the vector (W(t,),...,W(tn)) in any order. In particular, we may 
art by sampling W (tn ) from N(0,t,,) and proceed by conditionally sampling 
itermediate values, at each step conditioning on the two closest time points 
ready sampled (possibly including W(0) = 0). 

If n is a power of 2, the construction can be arranged so that each W (t;), 
< n, is generated conditional on the values W (tẹ) and W(t,) with the 
‘operty that 2 is midway between £ and r. Figure 3.2 details this case. If, for 
ample, n = 16, the algorithm starts by sampling W (ti¢); the first loop over 
samples W (tg); the second samples W (t4) and W(t,2); the third samples 
(t2), W(te), W (tio), and W(t14); and the final loop fills in all W(t;) with 
ld 7. If n is not a power of 2, the algorithm could still be applied to a subset 

2” < n of the t;, with the remaining points filled in at the end. 

Our discussion of the Brownian bridge construction (and Figure 3.2 in 
rticular) has considered only the case of a standard Brownian motion. How 
uld the construction be modified for a Brownian motion with drift u? Only 
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Sj 9 Si+] 


Fig. 3.1. Brownian bridge construction of Brownian path. Conditional on W(s;) = 
zi and W(si41) = vi+1, the value at s is normally distributed. The conditional mean 
is obtained by linear interpolation between x; and 2;41; the conditional variance is 
obtained from (3.8). 


Input: Time indices (t1,...,tam) 
Output: Path (wi,...,wem) with distribution of (W(t1),...,W(tem)) 


Generate (Z1,..., Zam) ~ N(0, T) 
hee 2 Free SI 
wh — VtrZn 
to — 0, Wo — 0 
or fo td goo 
ait s= h/2, a St ae 
L0, reh 
TOW 9 = henma 
a — CG — tijwe sd (ti z te)wr)/(tr — te) 
b — y/ (ti — te) (tr — ti) /(tr — te) 
wi — a + bZ: 
with Ce £+h, rerth 
end 
Dak — 2* Imax; 
h <= tin} 
end 
return (wi,... 


, Wam ) 


Fig. 3.2. Implementation of Brownian bridge construction when the number of 
time indices is a power of 2. The conditional standard deviations assigned to b could 
be precomputed and stored in an array (bi,...,bam) if multiple paths are to be 
generated. The interpolation weights used in calculating the conditional mean a 


could also be precomputed. 
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the first step — sampling of the rightmost point —- would change. Instead 
of sampling W (tm) from N(0,tm), we would sample it from N (utm, tm). The 
conditional distribution of W(t1),...,W(tn-1) given W (tm) is the same for 
ali values of u. Put slightly differently, a Brownian bridge constructed from 
a Brownian motion with drift has the same law as one constructed from a 
standard Brownian motion. (For any finite set of points ¢,,...,t,—1 this can 
be established from the Conditioning Formula (2.24).) Hence, to include a 
drift in the algorithm of Figure 3.2, it suffices to change just the third 
line, adding pt, to wp. For a Brownian motion with diffusion coefficient oĉ, 
the conditional mean (3.7) is unchanged but the conditional variance (3.8) 
is multiplied by o7. This could be implemented in Figure 3.2 by multiplying 
each b by o (and setting wp — utn +0ov/thZn in the third line); alternatively, 
the final vector (wi,...,W2m) could simply be multiplied by ø. 

Why use a Brownian bridge construction? The algorithm in Figure 3.2 has 
no computational advantage over the simple random walk recursion (3.2). Nor 
loes the output of the algorithm have any statistical feature not shared by 
she output of (3.2); indeed, the Brownian bridge construction is valid precisely 
yecause the distribution of the (W(t1),...,W(tm)) it produces coincides with 
chat resulting from (3.2). The potential advantage of the Brownian bridge 
construction arises when it is used with certain variance reduction techniques 
ind low-discrepancy methods. We will return to this point in Section 4.3 and 
Jhapter 5. Briefly, the Brownian bridge construction gives us greater control 
ver the coarse structure of the simulated Brownian path. For example, it 
ises a single normal random variable to determine the endpoint of a path, 
vhich may be the most important feature of the path; in contrast, the end- 
oint obtained from (3.2) is the combined result of n independent normal 
andom variables. The standard recursion (3.2) proceeds by evolving the path 
orward through time; in contrast, the Brownian bridge construction proceeds 
y adding increasingly fine detail to the path at each step, as illustrated in 
‘igure 3.3. This can be useful in focusing variance reduction techniques on 
important” features of Brownian paths. 


‘rincipal Components Construction 


s just noted, under the Brownian bridge construction a single normal random 
iriable (say Z,) determines the endpoint of the path; conditional on the 
idpoint, a second normal random variable (say Z2) determines the midpoint 
: the path, and so on. Thus, under this construction, much of the ultimate 
ape of the Brownian path is determined (or explained) by the values of just 
ie first few Z;. Is there a construction under which even more of the path 
determined by the first few Z,;? Is there a construction that maximizes the 
iiability of the path explained by 21,..., Zk for all k =1,...,n? 

This optimality objective is achieved for any normal random vector by the 
‘incipal components construction discussed in Section 2.3.3. We now discuss 
; application to a discrete Brownian path W (t1),..., W (tn). It is useful to 
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Fig. 3.3. Brownian bridge construction after 1, 2, 4, and 8 points have been sampled. 
Each step refines the previous path. 


visualize the construction in vector form as 


W (ti) Q11 Q12 Qin 

W (t2) a21 a22 A2n 

W (tn) Anl an2 Ann 
Let a; = (a1;,...,@n;)' and let A be the n x n matrix with columns 
Q1,...,Qn. We know from Section 2.3.3 that this is a valid construction of 


the discrete Brownian path if AA! is the covariance matrix C of W = 
(W(t1),...,W(tn))', given in (3.6). We also know from the discussion of 
principal components in Section 2.3.3 that the approximation error 


k 
E iw on z (æl? = 2" x) 


i=1 


from using just the first k terms in (3.9) is minimized for all k = 1,...,n 
by using principal components. Specifically, a; = /A;v;, i = 1,...,n, where 
Ay > Ag > +--+: > An > O are the eigenvalues of C and the v; are eigenvectors, 


CU; = Api, (ie ee ae 2 
normalized to have length ||v;|| = 1. 


Consider, for example, a 32-step discrete Brownian path with equal time 
increments t,41 — t; = 1/32. The corresponding covariance matrix has entries 
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Z; = min(i, j)/32, i, j =1,...,32. The magnitudes of the eigenvalues of this 
natrix drop off rapidly — the five largest are 13.380, 1.489, 0.538, 0.276, and 
).168. The variability explained by Z1,..., Zp (in the sense of (2.34)) is 81%, 
10%, 93%, 95%, and 96%, for k = 1,...,5; it exceeds 99% at k = 16. This 
ndicates that although full construction of the 32-step path requires 32 normal 
andom variables, most of the variability of the path can be determined using 
ar fewer Z;. 

Figure 3.4 plots the normalized eigenvectors v1, V2, v3, and v4 associated 
rith the four largest eigenvalues. (Each of these is a vector with 32 entries; 
hey are plotted against the 7At, 7 = 1,...,32, with At = 1/32.) The vi 
ppear to be nearly sinusoidal, with frequencies that increase with i. Indeed, 
.kesson and Lehoczky [8] show that for an n-step path with equal spacing 
+1 7 t; = At, 


(j) 2 -i (= 
Ui = ———— Sin | ———- 77 
J /2n+1 2n+1 


ad 
4 2n+12 
rz=1,...,n. To contrast this with the Brownian bridge construction in 


igure 3.3, note that in the principal components construction the v; are 
ultiplied by A\;Z; and then summed; thus, the discrete Brownian path 
ay be viewed as a random linear combination of the vectors v;, with random 
efficients \/\;Z;. The coefficient on v; has variance à; and we have seen that 
e A; drop off quickly. Thus, the first few v; (and /A;Z;) determine most of 
e shape of the Brownian path and the later v; add high-frequency detail to 
e path. As in the Brownian bridge construction, these features can be useful 
implementing variance reduction techniques by making it possible to focus 
ithe most important Z;. We return to this point in Sections 4.3.2 and 5.5.2. 
Although the principal components construction is optimal with respect 
explained variability, it has two drawbacks compared to the random walk 
d Brownian bridge constructions. The first is that it requires O(n?) oper- 
ions to construct W(t,),...,W(t,) from Z1,..., Zn, whereas the previous 
nstructions require O(n) operations. The second (potential) drawback is 
at with principal components none of the W(t;) is fully determined until 
Z1,.-..,4n have been processed — i.e., until all terms in (3.9) have been 
mmed. In contrast, using either the random walk or Brownian bridge con- 
uction, exactly k of the W(t,),...,W(t,) are fixed by the first k normal 
ıdom variables, for all k = 1,...,n. 
We conclude this discussion of the principal components construction with 
rief digression into simulation of a continuous path {W(t),0 < t< 1}. In 
> discrete case, the eigenvalue-eigenvector condition Cv = Av is (recall (3.6)) 


S min(t,,t;)v(7) = Av(i). 
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Fig. 3.4. First four eigenvectors of the covariance matrix of a 32-step Brownian 
path, ordered according to the magnitude of the corresponding eigenvalue. 


In the continuous limit, the analogous property for an eigenfunction w on 
[0, 1] is 


/ min(s, t)w(s) ds = A(t). 


The solutions to this equation and the corresponding eigenvalues are 


ay fee, [Citit = 2 ne 
h(t) = vein (SEU), x= (aray) ; ENN O TE 


Note in particular that the ~; are periodic with increasing frequencies and 
that the A; decrease with 7. The Karhounen-Loéve expansion of Brownian 


motion is 


W(t) = 5 VAtilDZr, O<t<1, (3.10) 


with Zo, Z1,... independent N (0,1) random variables; see, e.g., Adler [5]. 
This infinite series is an exact representation of the continuous Brownian 
path. It may be viewed as a continuous counterpart to (3.9). By taking just 
the first k terms in this series, we arrive at an approximation to the contin- 
uous path {W(t),0 < t < 1} that is optimal (among all approximations that 
use just k standard normals) in the sense of explained variability. This ap- 
proximation does not however yield the exact joint distribution for any subset 
{W(t1),...,W(tn)} except the trivial case {W (0)}. 

The Brownian bridge construction also admits a continuous counterpart 
through a series expansion using Schauder functions in place of the /A;v; 
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in (3.10). Lévy [233, pp.17-20] used the limit of the Brownian bridge con- 
struction to construct Brownian motion; the formulation as a series expan- 
sion is discussed in Section 1.2 of McKean [260]. Truncating the series after 
2™ terms produces the piecewise linear interpolation of a discrete Brownian 
bridge construction of W(0),W(2~™),...,W(1). See Acworth et al. [4] for 
further discussion with applications to Monte Carlo. 


3.1.2 Multiple Dimensions 


We call a process W (t) = (W1 (t),..., Wa(t))', 0 < t < T, a standard Brown- 
ian motion on R if it has W (0) = 0, continuous sample paths, independent 
increments, and 

W(t) —W(s) ~ N(0, (t — 8)J), 
for all 0 < s <t < T, with J the d x d identity matrix. It follows that each 
of the coordinate processes W;(t), i = 1,...,d, is a standard one-dimensional 
Brownian motion and that W; and W; are independent for i Æ j. 

Suppose p is a vector in R? and E is a d x d matrix, positive definite 
or semidefinite. We call a process X a Brownian motion with drift u and 
covariance © (abbreviated X ~ BM(y, X)) if X has continuous sample paths 
and independent increments with 


X(t) — X(s) ~ N((t— 8), (t — s)3). 


The initial value X(0) is an arbitrary constant assumed to be 0 unless oth- 
erwise specified. If B is a d x k matrix satisfying BB! = ¥ and if W isa 
standard Brownian motion on R*, then the process defined by 


X(t) = pt + BW (t) (311) 


is a BM(, X). In particular, the law of X depends on B only through BB". 
The process in (3.11) solves the SDE 


dX(t) = pdt + BdW(t). 


We may extend the definition of a multidimensional Brownian motion to de- 
terministic, time-varying uft) and E(t) through the solution to 


dX(t) = u(t) dt + B(t) aW (t), 


where B(t)B' (t) = E(t). This process has continuous sample paths, indepen- 
lent increments, and 


X(t) -X(s)~N (fw ae f 5(u) du) | 


A calculation similar to the one leading to (3.5) shows that if X ~ 
3M(u, ©), then 
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Cov|X;(s), X; (t)] = min(s, t)d,;. (3.12) 


In particular, given a set of times 0 < tı < to < -> < ty, we can easily find 
the covariance matrix of 


(Xı (tı), ee , Xa(tı), Xı (t2), P Xalta), oe., X] (ta); TEN Xalta) (3.13) 


along with its mean and reduce the problem of generating a discrete path of 
X to one of sampling this nd-dimensional normal vector. While there could 
be cases in which this is advantageous, it will usually be more convenient to 
use the fact that this nd-vector is the concatenation of d-vectors representing 
the state of the process at n distinct times. 


Random Walk Construction 


Let Z1, Z2,... be independent N (0, I) random vectors in Rt. We can construct 
a standard d-dimensional Brownian motion at times 0 = to < ti <: < tn 
by setting W (0) = 0 and 


W (ti41) = W (t;) + a/ ti+1 = t;Zi41, m= 0, TEPI eee 1. (3.14) 


This is equivalent to applying the one-dimensional random walk construction 
(3.2) separately to each coordinate of W. 

To simulate X ~ BM(u, £), we first find a matrix B for which BB' = X 
(see Section 2.3.3). If B is dx k, let Z1, Z2,... be independent standard normal 
random vectors in R*. Set X(0) = 0 and 


X (titi) = X (ti) + wig — ti) + Vbi41 —tiBZı, i=0,...,n—1. (3.15) 
Thus, simulation of BM(y, =) is straightforward once © has been factored. 
For the case of time-dependent coefficients, we may set 

ti41 
ac ea +f BG Gag. 0s n. ge 
ti 
with 
ti+ı 
B(ti, tiz1)B(ti, tizi) a u(u) du, 
t; 


thus requiring n factorizations. 


Brownian Bridge Construction 


Application of a Brownian bridge construction to a standard d-dimensional 
Brownian motion is straightforward: we may simply apply independent one- 
dimensional constructions to each of the coordinates. To include a drift vector 
(i.e., for BM(y, I) process), it suffices to add itn to Wi(tn) at the first step 
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of the construction of the ith coordinate, as explained in Section 3.1.1. The 
rest of the construction is unaffected. 

To construct X ~ BM(u, ©), we may use the fact that X can be rep- 
resented as X(t) = pt + BW(t) with B a d x k matrix, k < d, satisfy- 
ing BB! = X, and W a standard k-dimensional Brownian motion. We may 
then apply a Brownian bridge construction to W(t1),...,W(t,) and recover 
X(t1),.-.,X(t,) through a linear transformation. 


Principal Components Construction 


As with the Brownian bridge construction, one could apply a one-dimensional 
principal components construction to each coordinate of a multidimensional 
Brownian motion. ‘Through a linear transformation this then extends to the 
construction of BM(y, ©). However, the optimality of principal components is 
lost in this reduction; to recover it, we must work directly with the covariance 
matrix of (3.13). 

It follows from (3.12) that the covariance matrix of (3.13) can be repre- 
sented as (C & £), where © denotes the Kronecker product producing 


Cyd Cigh +++ Cind 
C21% Cob +++ Cond 


(CSA) = 
If C has eigenvectors v1,..., Un and eigenvalues à > --- > An, and if & 
has eigenvectors w1,...,Wq and eigenvalues nj > --- > ņa, then (C 8 X) 
has eigenvectors (v; ® wj) and eigenvalues \;7;,7 = 1,...,n, j = 1, x,d. This 


special structure of the covariance matrix of (3.12) makes it possible to reduce 
she computational effort required to find all eigenvalues and eigenvectors from 
she O((nd)?) typically required for an (nd x nd) matrix to O(n? + d’). 

If we rank the products of eigenvalues as 


(Ain cay Z Aina & + ing) (na): 


hen for any k = 1,...,n, 


k 
ora (Aig )(r) < DA Ài 


nd nr : ic 
ae (Ain; Nír) Da Ài 


n other words, the variability explained by the first k factors is always smaller 
yr a d-dimensional Brownian motion than it would be for a scalar Brown- 
in motion over the same time points. This is to be expected since the d- 
imensional process has greater total variability. 
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3.2 Geometric Brownian Motion 


A stochastic process S(t) is a geometric Brownian motion if log S(t) is a 
Brownian motion with initial value log S(0); in other words, a geometric 
Brownian motion is simply an exponentiated Brownian motion. Accordingly, 
all methods for simulating Brownian motion become methods for simulating 
geometric Brownian motion through exponentiation. This section therefore 
focuses more on modeling than on algorithmic issues. 

Geometric Brownian motion is the most fundamental model of the value 
of a financial asset. In his pioneering thesis of 1900, Louis Bachelier developed 
a model of stock prices that in retrospect we describe as ordinary Brownian 
motion, though the mathematics of Brownian motion had not yet been de- 
veloped. The use of geometric Brownian motion as a model in finance is due 
primarily to work of Paul Samuelson in the 1960s. Whereas ordinary Brown- 
ian motion can take negative values — an undesirable feature in a model of 
the price of a stock or any other limited liability asset — geometric Brownian 
motion is always positive because the exponential function takes only positive 
values. More fundamentally, for geometric Brownian motion the percentage 


changes 


S(t2) — S(t1) S(t3) — S(t2) Stn) — S(tn-1) 
> Gay a ae aam (3.16) 


are independent for tı < tg < --- < tn, rather than the absolute changes 
S(ti41) — S(t;). These properties explain the centrality of geometric rather 
than ordinary Brownian motion in modeling asset prices. 


3.2.1 Basic Properties 
Suppose W is a standard Brownian motion and X satisfies 
dX (t) = udt + o dW (t), 


so that X ~ BM(u, 0°). If we set S(t) = S(0) exp(X(t)) = f(X (t)), then an 
application of Itô’s formula shows that 


dS(t) = F'(X (t)) dX (t) + 30° f” (X(t) dt 
= §(0) exp(X (t))[u dt + o dW(t)] + 40° S (0) exp(X (t)) dt 
= S(t)(u + 407) dt + S(t)o dW (t). (3.17) 


In contrast, a geometric Brownian motion process is often specified through 


an SDE of the form 13) 
—— = udt+o0odW(t 3.18 


an expression suggesting a Brownian model of the “instantaneous returns” 
dS(t)/S(t). Comparison of (3.17) and (3.18) indicates that the models are 


y4 3 Generating Sample Paths 


inconsistent and reveals an ambiguity in the role of “u.” In (3.17), u is the 
drift of the Brownian motion we exponentiated to define S(t) — the drift of 
log S(t). In (3.18), S(t) has drift S(t) and (3.18) implies 


dlog S(t) = (u — 40°) dt + o dW (t), (3.19) 


as can be verified through Itô’s formula or comparison with (3.17). 

We will use the notation S ~ GBM (u, o°) to indicate that S is a process of 
the type in (3.18). We will refer to u in (3.18) as the drift parameter though 
it is not the drift of either S(t) or log S(t). We refer to o in (3.18) as the 
volatility parameter of S(t); the diffusion coefficient of S(t) is 07S*(t). 

From (3.19) we see that if S ~ GBM(y, o?) and if S has initial value S(0), 
then 


S(t) = S(0) exp ([u — 50° ]t + oW(t)). (3.20) 
A bit more generally, if u < t then 
S(t) = $(u) exp (Iu— 10t- u) +o(W()— Ww), (3.21) 


from which the claimed independence of the returns in (3.16) becomes ev- 
ident. Moreover, since the increments of W are independent and normally 
distributed, this provides a simple recursive procedure for simulating values 
of Sat 0 = to < ty <: < tn: 


S(ti41) = S(t;) exp (i = to’) (ti+ı or ti) +e liyi = ti Zia | ; 322) 
i=0,1,...,n—1, 


with Z1, Z2,..., Zn independent standard normals. In fact, (3.22) is equivalent 
to exponentiating both sides of (3.3) with u replaced by u — 30°. This method 
is exact in the sense that the (S(t,),...,S(tn)) it produces has the joint dis- 
tribution of the process S ~ GBM(p, 07) at t1,...,t, — the method involves 
no discretization error. ‘Time-dependent parameters can be incorporated by 
exponentiating both sides of (3.4). 


Lognormal Distribution 


“rom (3.20) we see that if S ~ GBM(u, 07), then the marginal distribution 
of S(t) is that of the exponential of a normal random variable, which is called 
ı lognormal distribution. We write Y ~ LN(, 07) if the random variable Y 
ias the distribution of exp(u + oZ), Z ~ N(0,1). This distribution is thus 
riven by 


P(Y < y) = P(Z < [log(y) — p]/c) 
_ » [ log(y) — 2 
-ð (see 


nd its density by 
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Lg (25W E), (3.23) 


yo o 
Moments of a lognormal random variable can be calculated using the basic 
identity 
1 
Eie] = ene 
for the moment generating function of a standard normal. From this it follows 
that Y ~ LN(p, 07) has 


1 2 2 
EY] = e#t3° Var[y] = eet? (e -1); 


in particular, the notation Y ~ LN (u,o?) does not imply that u and o° are 
the mean and variance of Y. From 


PY se) P72 0) =; 


we see that e” is the median of Y. The mean of Y is thus larger than the 
median, reflecting the positive skew of the lognormal distribution. 

Applying these observations to (3.20), we find that if S ~ GBM(u, 0°) 
then (S(t)/S(0)) ~ LN (|u — so° lt, o*t) and 


E[S(t)] = e“S(0), Var[S(t)] = e? 82(0) a = 1) | 


In fact, we have 
E[S(t)|S(7),0 < T < u] = E[S(t)|S(u)] = eX“ S(u), u< t, (3.24) 


and an analogous expression for the conditional variance. The first equality 
in (3.24) is the Markov property (which follows from the fact that S is a one- 
to-one transformation of a Brownian motion, itself a Markov process) and the 
second follows from (3.21). 

Equation (3.24) indicates that acts as an average growth rate for S, a 
sort of average continuously compounded rate of return. Along a single sample 
path of S the picture is different. For a standard Brownian motion W, we have 
t-1W(t) — 0 with probability 1. For S ~ GBM(p, 07), we therefore find that 


1 
z log S(t) + pb 3, 


with probability 1, so u— to? serves as the growth rate along each path. If this 
expression is positive, S(t) — oo as t > oo; if it is negative, then S(t) — 0. 
In a model with u > 0 > u — 0°, we find from (3.24) that E[S(t)] grows 
exponentially although S(t) converges to 0. This seemingly pathological be- 
havior is explained by the increasing skew in the distribution of S(t): although 
S(t) — 0, rare but very large values of S(t) are sufficiently likely to produce 
an increasing mean. 


~ + 


3.2.2 Path-Dependent Options 


Our interest in simulating paths of geometric Brownian motion lies primarily 
in pricing options, particularly those whose payoffs depend on the path of an 
underlying asset S and not simply its value S(T) at a fixed exercise date T. 
Through the principles of option pricing developed in Chapter 1, the price of 
an option may be represented as an expected discounted payoff. This price 
is estimated through simulation by generating paths of the underlying asset, 
evaluating the discounted payoff on each path, and averaging over paths. 


Risk-Neutral Dynamics 


The one subtlety in this framework is the probability measure with respect to 
which the expectation is taken and the nearly equivalent question of how the 
payoff should be discounted. This bears on how the paths of the underlying 
asset ought to be generated and more specifically in the case of geometric 
Brownian motion, how the drift parameter u should be chosen. 

We start by assuming the existence of a constant continuously compounded 
interest rate r for riskless borrowing and lending. A dollar invested at this rate 
at time 0 grows to a value of 


att) = e” 


at time t. Similarly, a contract paying one dollar at a future time ¢ (a zero- 
coupon bond) has a value at time 0 of e™™*. In pricing under the risk-neutral 
measure, we discount a payoff to be received at time t back to time 0 by 
dividing by (t); i.e., 8 is the numeraire asset. 
Suppose the asset S pays no dividends; then, under the risk-neutral mea- 
sure, the discounted price process S(t)/G(t) is a martingale: 
oy) =E (ZE tso),o Era u) : (3.25) 
JORNO | 


Comparison with (3.24) shows that if .S is a geometric Brownian motion under 
the risk-neutral measure, then it must have u = r; i.e., 
ar = rdt + o dW (t). (3.26) 
As discussed in Section 1.2.2, this equation helps explain the name “risk- 
neutral.” In a world of risk-neutral investors, all assets would have the same 
average rate of return — investors would not demand a higher rate of return 
for holding risky assets. In a risk-neutral world, the drift parameter for S(t) 
would therefore equal the risk-free rate r. 
In the case of an asset that pays dividends, we know from Section 1.2.2 that 
she martingale property (3.25) continues to hold but with S replaced by the 
sum of S, any dividends paid by S, and any interest earned from investing the 


3.2 Geometric Brownian Motion 97 


dividends at the risk-free rate r. Thus, let D(t) be the value of any dividends 
paid over [0, t| and interest earned on those dividends. Suppose the asset pays 
a continuous dividend yield of 6, meaning that it pays dividends at rate S(t) 
at time t. Then D grows at rate 

dD(t 

R ôS (t) +r D(t), 

dt 

the first term on the right reflecting the influx of new dividends and the 
second term reflecting interest earned on dividends already accumulated. If 
S ~ GBM(u, o°), then the drift in (S(t) + D(t)) is 


S(t) + 6S(t) + rD(t). 


The martingale property (3.25), now applied to the combined process (S(t) + 
D(t)), requires that this drift equal r(S(t) + D(t)). We must therefore have 
utd =r; ie., u =r — ô. The net effect of a dividend yield is to reduce the 


growth rate by ô. 
We discuss some specific settings in which this formulation is commonly 


used: 


o Equity Indices. In pricing index options, the level of the index is often 
modeled as geometric Brownian motion. An index is not an asset and it 
does not pay dividends, but the individual stocks that make up an index 
may pay dividends and this affects the level of the index. Because an index 
may contain many stocks paying a wide range of dividends on different 
dates, the combined effect is often approximated by a continuous dividend 
yield ô. 

o Exchange Rates. In pricing currency options, the relevant underlying vari- 
able is an exchange rate. We may think of an exchange rate S (quoted as 
the number of units of domestic currency per unit of foreign currency) as 
the price of the foreign currency. A unit of foreign currency earns interest at 
some risk-free rate rf, and this interest may be viewed as a dividend stream. 
Thus, in modeling an exchange rate using geometric Brownian motion, we 
set uU=r—Tf. 

o Commodities. A physical commodity like gold or oil may in some cases 
behave like an asset that pays negative dividends because of the cost of 
storing the commodity. This is easily accommodated in the setting above 
by taking ô < 0. There may, however, be some value in holding a physical 
commodity; for example, a party storing oil implicitly holds an option to sell 
or consume the oil in case of a shortage. This type of benefit is sometimes 
approximated through a hypothetical convenience yield that accrues from 
physical storage. ‘The net dividend yield in this case is the difference between 
the convenience yield and the cost rate for storage. 

o Futures Contracts. A futures contract commits the holder to buying an 
underlying asset or commodity at a fixed price at a fixed date in the future. 
The futures price is the price specifed in a futures contract at which both 


w $2 SS Ses e ia ee RAY AS 


the buyer and the seller would willingly enter into the contract without 
either party paying the other. A futures price is thus not the price of an 
asset but rather a price agreed upon for a transaction in the future. 

Let S(t) denote the price of the underlying asset (the spot price) and let 
F(t,T) denote the futures prices at time t for a contract to be settled at a 
fixed time T in the future. Entering into a futures contract at time t to buy 
the underlying asset at time T > t is equivalent to committing to exchange 
a known amount F(t,T) for an uncertain amount S(T). For this contract 
to have zero value at the inception time t entails 


0 =e T-DES(T) — F(t,T))|Fil, (3.27) 


where F; is the history of market prices up to time t. At t = T the spot 
and futures prices must agree, so S(T) = F(T,T) and we may rewrite this 


condition as 


Thus, the futures price is a martingale (in its first argument) under the 
risk-neutral measure. It follows that if we choose to model a futures price 
(for fixed maturity T) using geometric Brownian motion, we should set its 
drift parameter to zero: 

dF (t, T) 


FET) = o dW (t). 


Comparison of (3.27) and (3.25) reveals that 
F(t, T) = eC 79-0 S(t), 


with ô the net dividend yield for S. If either process is a geometric Brownian 
motion under the risk-neutral measure then the other is as well and they 
have the same volatility oc. __ 

This discussion blurs the distinction between futures and forward contracts. 
The argument leading to (3.27) applies more specifically to a forward. price 
because a forward contract involves no intermediate cashflows. The holder 
of a futures contract typically makes or receives payments each day through 
a margin account; the discussion above ignores these cashflows. In a world 
with deterministic interest rates, futures and forward prices must be equal 
to preclude arbitrage so the conclusion in (3.27) is valid for both. With 
stochastic interest rates, it turns out that futures prices continue to be 
martingales under the risk-neutral measure but forward prices do not. The 
theoretical relation between futures and forward prices is investigated in 
Cox, Ingersoll, and Ross [90]; it is also discussed in many texts on derivative 


securities (e.g., Hull [189]). 
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Path-Dependent Payoffs 


We turn now to some examples of path-dependent payoffs frequently encoun- 
tered in option pricing. We focus primarily on cases in which the payoff de- 
pends on the values S(t1),..., S(tn) at a fixed set of dates tı, ..., tn; for these 
it is usually possible to produce an unbiased simulation estimate of the op- 
tion price. An option payoff could in principle depend on the complete path 
{S(t),0 < t < T} over an interval [0,7]; pricing such an option by simula- 

tion will often entail some discretization bias. In the examples that follow, — 
we distinguish between discrete and continuous monitoring of the underlying 


asset. 


o Asian option: discrete monitoring. An Asian option is an option on a time 
average of the underlying asset. Asian calls and puts have payoffs (S — K y 
and (K — S)* respectively, where the strike price K is a constant and 


= 15s) (3.28) 


is the average price of the underlying asset over the discrete set of mon- 
itoring dates t1,...,t,. Other examples have payoffs (S(T) — S)* and 
(S — S(T))*. There are no exact formulas for the prices of these options, 
largely because the distribution of S is intractable. 

o Asian option: continuous monitoring. The continuous counterparts of the 
discrete Asian options replace the discrete average above with the continu- 


ous average 
_ 1 
g= / gjd 
ESU Ja 


over an interval |u, t|. Though more difficult to simulate, some instances 
of continuous-average Asian options allow pricing through the transform 
analysis of Geman and Yor [135] and the eigenfunction expansion of Linet- 
sky [237]. 

o Geometric average option. Replacing the arithmetic average S in (3.28) 


with 7 
(11 st) 


produces an option on the geometric average of the underlying asset price. 
Such options are seldom if ever found in practice, but they are useful as 
test cases for computational procedures and as a basis for approximating 
ordinary Asian options. They are mathematically convenient to work with 
because the geometric average of (jointly) lognormal random variables is 
itself lognormal. From (3.20) we find (with u replaced by r) that 


H 1/n a F 
(i se) = S (0) exp (r — Ta Dt + zywe) . 


— see een 


ww a 


From the Linear Transformation Property (2.23) and the covariance matrix 
(3.6), we find that 


5 W (ti) ~N (o Sa = Dens 


It follows that the geometric average of S(ti),...,S(t,) has the same dis- 
tribution as the value at time T of a process GBM(r — ô, °) with 


1S 29 oo i ae oo? 
ae Oo mar — Virsa Ô = 50 — 30 ‘ 


An option on the geometric average may thus be valued using the Black- 
Scholes formula (1.44) for an asset paying a continuous dividend yield. The 


expression 
t 
exp (J log S(T) ir) 


is a continuously monitored version of the geometric average and is also 
lognormally distributed. Options on a continuous geometric average can 
similarly be priced in closed form. 

Barrier options. A typical example of a barrier option is one that gets 
“knocked out” if the underlying asset crosses a prespecified level. For in- 
stance, a down-and-out call with barrier b, strike K, and expiration T has 


payoff 
HOSTS jak), 


where 
7(b) = inf{t,; : S(t;) < b} 

is the first time in {t;,...,t,} the price of the underlying asset drops below 
b (understood to be oo if S(t;) > b for all i) and 1{ } denotes the indicator of 
the event in braces. A down-and-in call has payoff 1{7(b) < T}(S(T)—K): 
it gets “knocked in” only when the underlying asset crosses the barrier. Up- 
and-out and up-and-in calls and puts are defined analogously. Some knock- 
out options pay a rebate if the underlying asset crosses the barrier, with 
the rebate possibly paid either at the time of the barrier crossing or at the 
expiration of the option. . 
These examples of discretely monitored barrier options are easily priced 
by simulation through sampling of S(t1),..., S(tn), S(T). A continuously 
monitored barrier option is knocked in or out the instant the underlying 
asset crosses the barrier; in other words, it replaces 7(b) as defined above 
with 

7(b) = inf{t > 0: S(t) < b}. 
Both discretely monitored and continuously monitored barrier options are 
found in practice. Many continuously monitored barrier options can be 
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priced in closed form; Merton [261] provides what is probably the first such 
formula and many other cases can be found in, e.g., Briys et al. [62]. Dis- 
cretely monitored barrier options generally do not admit pricing formulas 
and hence require computational procedures. 

o Lookback options. Like barrier options, lookback options depend on extremal 
values of the underlying asset price. Lookback puts and calls expiring at tn 
have payoffs 

( max S(t;)—S(tn)) and (S(tn)— min S(t;)) 

aa erry 7) tlni 
respectively. A lookback call, for example, may be viewed as the profit from 
buying at the lowest price over t1, ...,tn and selling at the final price S (tn). 
Continuously monitored versions of these options are defined by taking the 
maximum or minimum over an interval rather than a finite set of points. 


Incorporating a Term Structure 


Thus far, we have assumed that the risk-free interest rate r is constant. This 
implies that the time-t price of a zero-coupon bond maturing (and paying 1) 


at time T > t is 
B(t, T) = e7"T-9, (3.29) 


Suppose however that at time 0 we observe a collection of bond prices B(0, T), 
indexed by maturity T, incompatible with (3.29). To price an option on an 
underlying asset price S consistent with the observed term structure of bond 
prices, we can introduce a deterministic but time-varying risk-free rate r(u) 
by setting 


EO) = z log B(0,T) 


T=uü 


T 
B(0,T) = exp (-/ r(u) au) l 


With a deterministic, time-varying risk-free rate r(u), the dynamics of an 
asset price S(t) under the risk-neutral measure (assuming no dividends) are 
described by the SDE 


Clearly, then, 


oe r(t) dt + o dW (t) 


with solution 
t 
S(t) = S(0) exp (| r(u) du — 40°t + ow) 
0 


This process can be simulated over 0 = to < tı <--- < tn by setting 


o Maa wuss WOE PIG L QAvU1liDb 


tits 
S(tiz1) = S(ti) exp (/ r(u) du — 50" (tiga —t))+oV/tii — EZ ) ; 
ti 


with Z1,..., Zn independent N (0,1) random variables. 

If in fact we are interested only in values of S(t) at t1,...,tn, the simulation 
can be simplified, making it unnecessary to introduce a short rate r(w) at all. 
If we observe bond prices B(0,t1),...,B(0,t,) (either directly or through 
interpolation from other observed Brice): then since 


Pot) — ex ( f j r(u) du), 


we may simulate S(t) using 


B(0,t; 
S(ti+1) = St) By exp (—40? (t i+1 — ti i) + o/tini — t; Zina) (3. 30) 


Simulating Off a Forward Curve 


For some types of underlying assets, particularly commodities, we may observe 
not just a spot price S(0) but also a collection of forward prices F'(0, T). Here, 
F(0, T) denotes the price specified in a contract at time 0 to be paid at time T 
for the underlying asset. Under the risk-neutral measure, F'(0,T) = E[S(T)]; 
in particular, the forward prices reflect the risk-free interest rate and any div- 
idend yield (positive or negative) on the underlying asset. In pricing options, 
we clearly want to simulate price paths of the underlying asset consistent with 
the forward prices observed in the market. 

The equality F'(0,7) = E[S(7')] implies 


S(T) = F(0,T) exp (—$0°T +oW(T)). 


Given forward prices F'(0, tı), -.., F(0,tn), we can simulate using 
F(0, ti 
S(ti+1) = S(t) A ex p (- io "(tes — t; i) + Orv/ti+t — t; Zins) 


This generalizes (3.30) because in the absence of dividends we have F'(0, T) = 
S(0)/B(0, T). Alternatively, we may define M (0) = 1, 


M (ti41) = M (ti) exp (—407 (tins eee ee mea cs Meee iZi) To a 
and set S(ti) = F(0, ti) M (ti), iea NE 


Deterministic Volatility Functions 


Although geometric Brownian motion remains an important benchmark, it has 
been widely observed across many markets that option prices are incompatible 
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with a GBM model for the underlying asset. This has fueled research into 
alternative specifications of the dynamics of asset prices. 

Consider a market in which several options with various strikes and ma- 
turities are traded simultaneously on the same underlying asset. Suppose the 
market is sufficiently liquid that we may effectively observe prices of the op- 
tions without error. If the assumptions underlying the Black-Scholes formula 
held exactly, all of these option prices would result from using the same volatil- 
ity parameter ø in the formula. In practice, one usually finds that this implied 
volatility actually varies with strike and maturity. It is therefore natural to 
seek a minimal modification of the Black-Scholes model capable of reproduc- 
ing market prices. 

Consider the extreme case in which we observe the prices C(K,T) of call 
options on a single underlying asset for a continuum of strikes K and matu- 
rities T. Dupire [107] shows that, subject only to smoothness conditions on 
C as a function of K and T, it is possible to find a function o(S, t) such that 
the model ar 

oH = rdt+a(S(t),t)dW(t) 


reproduces the given option prices, in the sense that 
eT" E[(S(T) — K)*] = C(K,T) 


for all K and T. This is sometimes called a deterministic volatility function to 
emphasize that it extends geometric Brownian motion by allowing ø to be a 
deterministic function of the current level of the underlying asset. This feature 
is important because it ensures that options can still be hedged through a 
position in the underlying asset, which would not be the case in a stochastic 
volatility model. 

In practice, we observe only a finite set of option prices and this leaves a 
great deal of flexibility in specifying o($,t) while reproducing market prices. 
We may, for example, impose smoothness constraints on the choice of volatility 
function. ‘This function will typically be the result of a numerical optimization 
procedure and may never be given explicitly. 

Once o(S,t) has been chosen to match a set of actively traded options, 
simulation may still be necessary to compute the prices of less liquid path- 
dependent options. In general, there is no exact simulation procedure for these 
models and it is necessary to use an Euler scheme of the form 


S(tiz1) = S(t) (1 +r (tins — ti) + O(S (ti), te) tina — Zin), 
with Z1, Z2,... independent standard normals, or 


S(tiz1) = 
S(ti) exp ([r — 50° (S(t), te) (tiga — ty) + o(S(K), ti) ftir — i Zis1) i 


which is equivalent to an Euler scheme for log S(t). 
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3.2.3 Multiple Dimensions 


A multidimensional geometric Brownian motion can be specified through a 
system of SDEs of the form 


dS; (t) 


= [bi PAX. “eS lesd, 3.31 
S pidt +a;,aX;(t), i d (3.81) 


where each X; is a standard one-dimensional Brownian motion and X;(t) 
and X;(t) have correlation p;;. If we define a d x d matrix X by setting 
Xij = 010; Pij, then (01X1, ..., oa Xa) ~ BM(0, £). In this case we abbreviate 
the process S = (5),...,5q) as GBM(y, ©) with u = (m,..., Ha). In a con- 
venient abuse of terminology, we refer to u as the drift vector of S, to X as its 
covariance matrix and to the matrix with entries p;; as its correlation matrix; 
the actual drift vector is (4151(t),..., uaSa(t)) and the covariances are given 
by 
Cov[ S(t), S4 (£)] = S:(0)8; (0)e TH) (Puos — 1), 


This follows from the representation 
1 
gass e Gi. id. 


Recall that a Brownian motion BM(0, £) can be represented as AW (t) 
with W a standard Brownian motion BM(0,/) and A any matrix for which 
AA! =. We may apply this to (o1.X1,...,0q¢Xq) and rewrite (3.31) as 


ay = pi dit+a;dW(t), i=1,...,d, (3.32) 
with a; the ith row of A. A bit more explicitly, this is 
dS;;(t) | 
Sy EHD Ay WSO) Madd. 


This representation leads to a simple algorithm for simulating GBM(p, X) 
at times 0 = to < ti <- < ty: 


Siltes) = Silty le 2° write Auta Gd 
(3.33) 
i i= O,...,n = t; where Zk = (241,---; Zka) ~ N(0, 1) and VA PEI EEE Li 
are independent. As usual, choosing A to be the Cholesky factor of X can 
‘educe the number of multiplications and additions required at each step. 
Notice that (3.33) is essentially equivalent to exponentiating both sides of the 
ecursion (3.15); indeed, all methods for simulating BM(y, ©) provide methods 
or simulating GBM(, ©) (after replacement of u; by mi — $07). 
The discussion of the choice of the drift parameter u in Section 3.2.2 applies 
qually well to each u; in pricing options on multiple underlying assets. Often, 
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ui =r — 0; where r is the risk-free interest rate and ð; is the dividend yield 
on the ith asset S;. 


O 


O 


We list a few examples of option payoffs depending on multiple assets: 


Spread option. A call option on the spread between two assets S1, S2 has 
payoff 

(Sut) SST 
with K a strike price. For example, crack spread options traded on the New 
York Mercantile Exchange are options on the spread between heating oil 


and crude oil futures. 
Basket option. A basket option is an option on a portfolio of underlying 


assets and has a payoff of, e.g., 
([c1$1(T) + c282(T) +--+» + egSq(T)] — KY”. 


Typical examples would be options on a portfolio of related assets — bank 
stocks or Asian currencies, for instance. 

Outperformance option. These are options on the maximum or minimum of 
multiple assets and have payoffs of, e.g., the form 


(max{c1 5} (T), C952 (T), ae caSa(T)} ey k)*. 
Barrier options. A two-asset barrier option may have a payoff of the form 


H min S2(t:) < b}(K — $1(T))*: 


This is a down-and-in put on Sı that knocks in when S2 drops below a 
barrier at b. Many variations on this basic structure are possible. In this 
example, one may think of Sı as an individual stock and S as the level 
of an equity index: the put on the stock is knocked in only if the market 
drops. 

Quantos. Quantos are options sensitive both to a stock price and an ex- 
change rate. For example, consider an option to buy a stock denominated 
in a foreign currency with the strike price fixed in the foreign currency but 
the payoff of the option to be made in the domestic currency. Let S4 de- 
note the stock price and S2 the exchange rate, expressed as the quantity of 
domestic currency required per unit of foreign currency. Then the payoff of 
the option in the domestic currency is given by 


$2(T)(S1(T) — K)*. (3.34) 


(5) stm) 


corresponds to a quanto in which the level of the strike is fixed in the do- 
mestic currency and the payoff of the option is made in the foreign currency. 


The payoff 


v VOUC ALLL YaLLpLle ratns 


ae 


Change of Numeraire 


The pricing of an option on two or more underlying assets can sometimes 
be transformed to a problem with one less underlying asset (and thus to a 
lower-dimensional problem) by choosing one of the assets to be the numeraire. 
Consider, for example, an option to exchange a basket of assets for another 


asset with payoff 
d—1 i 
(S oS;,(T) = esa(n) A 
i=1 


for some constants c;. The price of the option is given by 


d—1 T 
e "tE (S ciSi(T) — easu(0)) l (3.35) 


a=] 


the expectation taken under the risk-neutral measure. Recall that this is the 
measure associated with the numeraire asset 3(t) = e™* and is characterized 
by the property that the processes 5;(t)/@(¢), i = 1,...,d, are martingales 


under this measure. 
As explained in Section 1.2.3, choosing a different asset as numeraire — 


say Sq — means switching to a probability measure under which the processes 
S;(t)/Sa(t), i = 1,...,d—1, and G(t)/Sq(t) are martingales. More precisely, 
if we let Ps denote the risk-neutral measure, the new measure Ps, is defined 
by the likelihood ratio process (cf. Appendix B.4) 


Ga) = Salt) (0) | (3.36) 


Through this change of measure , the option price (3.35) can be expressed 


as 


2 t dP 
—rT 
€e Es, 6 ciSilT) = easu(0)) es i 


i1=1 


d—1 T 
as e "TEs, S EuS T) => asut) Ea) 


i=1 


d—1 + 
= Sa(O)Es, (> Ci oe = 3 ) 


with Es, denoting expectation under Ps,. From this representation it becomes 
clear that only the d — 1 ratios S;(T)/Sq(T) (and the constant S4(0)) are 
needed to price this option under the new measure. We thus need to determine 
the dynamics of these ratios under the new measure. 
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Using (3.32) and (3.36), we find that 


dPs 
( a), = exp (—$04 + aqW(t)) . 


Girsanov’s Theorem (see Appendix B.4) now implies that the process 
W4(t) = W(t) —agt 


is a standard Brownian motion under Ps,. Thus, the effect of changing nu- 
meraire is to add a drift a' to W. The ratio S;(t)/ Salt) is given by 
Si(t) _ Si(0) 


Sad) = Sa(o) XP (aE + goat + (ai ~ aa) WO) 


S;(0) £294 E d af 
= — #0; t 5 i T t 
$4(0) exp ( z0; t + 504t + (ai — aa)(W° (t) + aq )) 
Si (0) 1 T d 
J TZM Z i i t ) 
Sa(0) exp (—3(a; — aa)(ai — aa) + (a; — ag) W*(t)) 
using the identities aja; = oF, j =1,...,d, from the definition of the a; in 


(3.32). Under Ps,, the scalar process (a; — aq)W%(t) is a Brownian motion 
with drift 0 and diffusion coefficient (a; — aqg)(a; —ag)'. This verifies that the 
ratios S;/Sq are martingales under Ps, and also that (S1/Sq,...,Sq—1/Sa) 
remains a multivariate geometric Brownian motion under the new measure. It 
is thus possible to price the option by simulating just this (d — 1)-dimensional 
process of ratios rather than the original d-dimensional process of asset prices. 

This device would not have been effective in the example above if the 
payoff in (3.35) had instead been 


d =e 
(£ CiS; (T) aa «| 
i=1 
with K a constant. In this case, dividing through by Sa(T) would have pro- 
duced a term K/Sa(T) and would thus have required simulating this ratio as 
well as S;/Sqg, i = 1,...,d — 1. What, then, is the scope of this method? If 
the payoff of an option is given by g(Si(T),..., Sa(T)), then the property we 
need is that g be homogeneous of degree 1, meaning that 


GOP Gases SOTA = Qg(£1,..-, £d) 


for all scalars œ and all 71,...,2q. For in this case we have 


Sa(T),..., atl 
HEUTE) _ 6(5,(0)/SalT)s---1Sa-1(T)/Sa(T),V) 
Sa(T) 
and taking one of the underlying assets as numeraire does indeed reduce by 
one the relevant number of underlying stochastic variables. See Jamshidian 
[197] for a more general development of this observation. 
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3.3 Gaussian Short Rate Models 


This section and the next develop methods for simulating some simple but 
important stochastic interest rate models. These models posit the dynamics 
of an instantaneous continuously compounded short rate r(t). An investment 
in a money market account earning interest at rate r(u) at time u grows from 
a value of 1 at time O to a value of 


Cae ( / aa du) 


at time t. Though this is now a stochastic quantity, it remains the numeraire 
for risk-neutral pricing. The price at time 0 of a derivative security that pays 
X at time T is the expectation of X/G(T), i.e., 


E es (- a r(u) i x i (3.37) 


the expectation taken with respect to the risk-neutral measure. In particular, 
the time-0 price of a bond paying 1 at T is given by 


B(0,T) =E en (- A r(u) du) ! (3.38) 


We focus primarily on the dynamics of the short rate under the risk-neutral 
measure. 

The Gaussian models treated in this section offer a high degree of tractabil- 
ity. Many simple instruments can be priced in closed form in these models or 
using deterministic numerical methods. Some extensions of the basic models 
and some pricing applications do, however, require simulation for the calcu- 
lation of expressions of the form (3.37). The tractability of the models offers 
opportunities for increasing the accuracy of simulation. 


3.3.1 Basic Models and Simulation 


The classical model of Vasicek [352] describes the short rate through an 
Ornstein-Uhlenbeck process (cf. Karatzas and Shreve [207], p.358) 


dr(t) = a(b — r(t)) dt + o dW (t). (3.39) 


Here, W is a standard Brownian motion and a, b, and o are positive constants. 
Notice that the drift in (3.39) is positive if r(t) < b and negative if r(t) > b; 
thus, r(t) is pulled toward level b, a property generally referred to as mean 
reversion. We may interpret b as a long-run interest rate level and & as the 
speed at which r(t) is pulled toward b. The mean-reverting form of the drift is 
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an essential feature of the Ornstein-Uhlenbeck process and thus of the Vasicek 


model. 
The continuous-time Ho-Lee model [185] has 


dr(t) = g(t) dt + odW (t) (3.40) 


with g a deterministic function of time. Both (3.39) and (3.40) define Gaussian 
processes, meaning that the joint distribution of r(t1),...,7(t,) is multivariate 
normal for any t1,..., tn. Both define Markov processes and are special cases 
of the general Gaussian Markov process specified by 


dr(t) = [g(t) + h(t)r(t)] dt + ot) dW(t), (3.41) 


with g, h, and o all deterministic functions of time. Natural extensions of 
(3.39) and (3.40) thus allow ø, b, and a to vary with time. Modeling with the 
Vasicek model when b in particular is time-varying is discussed in Hull and 


White [190]. 
The SDE (3.41) has solution 


t t 
r(t) =e r(0) + | ef O-A(3) g(8) ds + / el (4)—-4(8) o(s) dW (s), 
0 0 


with : 
Haye / h(s) ds, 


as can be verified through an application of Itô’s formula. Because this pro- 
duces a Gaussian process, simulation of r(t,),...,7(tn) is a special case of the 
general problem of sampling from a multivariate normal distribution, treated 
in Section 2.3. But it is a sufficiently interesting special case to merit consid- 
eration. To balance tractability with generality, we will focus on the Vasicek 
model (3.39) with time-varying b and on the Ho-Lee model (3.40). Similar 
ideas apply to the general case (3.41). 


Simulation 


For the Vasicek model with time-varying b, the general solution above spe- 


cializes to 
t t | 
r(t) =e “r(0) + a | e~ “(E-8) p(s) ds + o | e72lt-s) AW (s8). (3.42) 
0 0 
Similarly, for any 0 < u < t, 
t t 
r(t) = e72- p(y) + a | e7et—s)b(s) ds + a e ots) dW (s). 


From this it follows that, given r(u), the value r(t) is normally distributed 


with mean 


LLU 3 Generating Sample Paths 
t 
e~XE-Yr(u) + u(u,t), p(u,t) = a | e~“t-8) b( 3) ds (3.43) 
and variance 
t g? 
o*(u,t) = o | e 20lt-8) ds = F (1 — en] i (3.44) 
i Q 


To simulate r at times 0 = tọ < tı <--- < tn, we may therefore set 
r(tipi) =e t) (ti) + ulti, tiga) + or (ti, beg) Zips, (3.45) 


with Z1,..., Zn independent draws from N (0, 1). 

This algorithm is an exact simulation in the sense that the distribution 
of the r(t1),...,r(tn) it produces is precisely that of the Vasicek process at 
times t,,...,tn for the same value of r(0). In contrast, the slightly simpler 


Euler scheme 
r(ti41) = r(t;) + a(b(t;) = r(ti))(ti+1 -5 ti) + Or/ti+4 a tiiti 


entails some discretization error. Exact simulation of the Ho-Lee process (3.40) 
is a special case of the method in (3.4) for simulating a Brownian motion with 
time-varying drift. 

In the special case that b(t) = b, the algorithm in (3.45) simplifies to 


r(tia1) = eT tt p(t )+b(1—e7 ttit) o] = (1 — e~?alti+i—t:)) Ziga. 

(3.46) 
The Euler scheme is then equivalent to making the approximation eë ~ 1+ 2 
for the exponentials in this recursion. 

Evaluation of the integral defining u(ti,ti+1) and required in (3.45) may 
seem burdensome. The effort. involved in evaluating this integral clearly de- 
pends on the form of the function b(t) so it is worth discussing how this 
function is likely to be specified in practice. Typically, the flexibility to make 
b vary with time is used to make the dynamics of the short rate consistent 
with an observed term structure of bond prices. The same is true of the func- 
tion g in the Ho-Lee model (3.40). We return to this point in Section 3.3.2, 
where we discuss bond prices in Gaussian models. 


Stationary Version 
Suppose b(t) = b and a > 0. Then from (3.43) we see that 
E[r(t)] =e“ r(0) +(1-e"™ )b > bs as t > o, 


so the process r(t) has a limiting mean. It also has a limiting variance given 
(via (3.44)) by 
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TE Or —2at g’ 
jim, Var[r(t)| = lim 30 (1 —e7**) = a 
In fact, r(t) converges in distribution to a normal distribution with this mean 


and variance, in the sense that for any rE R 


P(r(t) <2) +4 Gea | 


with ® the standard normal distribution. The fact that r(t) has a limiting 
distribution is a reflection of the stabilizing effect of mean reversion in the drift 
and contrasts with the long-run behavior of, for example, geometric Brownian 
motion. 

The limiting distribution of r(t) is also a stationary distribution in the 
sense that if r(0) is given this distribution then every r(t), t > 0, has this 
distribution as well. Because (3.46) provides an exact discretization of the 
process, the N(b,07/2a) distribution is also stationary for the discretized 
process. To simulate a stationary version of the process, it therefore suffices 
to draw r(0) from this normal distribution and then proceed as in (3.46). 


3.3.2 Bond Prices 


As already noted, time-dependent drift parameters are typically used to make 
a short rate model consistent with an observed set of bond prices. Implemen- 
tation of the simulation algorithm (3.45) is thus linked to the calibration of the 
model through the choice of the function b(t). The same applies to the func- 
tion g(t) in the Ho-Lee model and as this case is slightly simpler we consider 
it first. > 

Our starting point is the bond-pricing formula (3.38). The integral of r(u) 
from 0 to T appearing in that formula is normally distributed because r(u) 
is a Gaussian process. It follows that the bond price is the expectation of 
the exponential of a normal random variable. For a normal random variable 
X ~ N(m,v"), we have Elexp(X)] = exp(m + (v? /2)), so 


E E (- [ r(t) a) S (-« if r(t) a + iVar [oro a | 


(3.47) 
To find the price of the bond we therefore need to find the mean and variance 


of the integral of the short rate. 
In the Ho-Lee model, the short rate is given by 


r(t) = r(0) + / g(s)ds +aW(t) 


and its integral by 


[roa=rort f [otedsaure f Wiad 


This integral has mean 


T u 
r(0)T + f / g(s) ds du 
o Jo 
and variance 


is e / Ww) au 29g? / J / ' Cov[W (u), W (®©) du dt 


T pt 
= 20? | f u du dt 
0 JO 

1 


= 30 T”. (3.48) 


Substituting these expressions in (3.47), we get 


B(0,T) =E e (- a r(u) iw) 
= exp (ror [ [ g(s) ds du + or) 


If we are given a set of bond prices B(0,7') at time 0, our objective is to 
choose the function g so that this equation holds. 
To carry this out we can write 


B(0,T) = exp ¢ [ to t) a 


with f(0,t) the instantaneous forward rate for time t as of time 0 (cf. Appen- 
dix C). The initial forward curve f(0,T) captures the same information as 
the initial bond prices. Equating the two expressions for B(0,T) and taking 


logarithms, we find that 


T u g2 3 T 
ror f / g(s) ds du — = =| f (0, t) dt. 


Differentiating twice with respect to the maturity argument T, we find that 


g(t) = £ JOT seek (3.49) 


TS 
Thus, bond prices produced by the Ho-Lee model will match a given set of 
bond prices B(0, T) if the function g is tied to the initial forward curve f (0, T) 
in this way; i.e., if we specify 
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drt) = (= (0, T) 


+ ot) dt + o dW (t). (3.50) 
T =t 


A generic simulation of the Ho-Lee model with drift function g can be 


written as 
ti+ıi 
Pai) =r) +f g(s)ds + oy/tiga — ti Zi+1, 
ti 


with Z1, Z2,... independent N (0,1) random variables. With g chosen as in 
(3.49), this simplifies to 


2 
o 
r(tizi) = r(ti) + [f (0, ti+1) — f(O, t:)] + ia — t] + oti — tiZi+1. 


Thus, no integration of the drift function g is necessary; to put it another 
way, whatever integration is necessary must already have been dealt with in 
choosing the forward curve f(0,t) to match a set of bond prices. 

The situation is even simpler if we require that our simulated short rate 
be consistent only with bonds maturing at the simulation times t1,..., tn. To 
satisfy this requirement we can weaken (3.49) to the condition that 


ti+ı E ; 
J g(s) ds = f (0, ti+1) — F(0, ti) + > [tas =i, 


Except for this constraint, the choice of g is immaterial — we could take it to 
be continuous and piecewise linear, for example. In fact, we never even need 
to specify g because only its integral over the intervals (t;, ti+1) influence the 
values of r on the time grid t1,..., tn. 


Bonds in the Vasicek Model 


A similar if less explicit solution applies to the Vasicek model. The integral 
of the short rate is again normally distributed; we need to find the mean and 
variance of this integral to find the price of a bond using (3.47). Using (3.42), 
for the mean we get 


E [ro a = [ E[r(t)] dt 


= T _ eo? r A = See) s) ds 
=a \r(0) + / | b(s) ds dt. (3.51) 


For the variance we have 


at | / o a = 9 / : | a: (3.52) 


o sew, wus, VALLE LF abils 


From (3.42) we get, for u < t, 
Covi[r(t), r(u)| = | pe Ald 
0 


= oF Gos ss oe) (3 53) 
20 l l 


Integrating two more times as required for (3.52) gives 


Wat [ow a = a Ir + =. Caen) a 2 (eee 1) (3.54) 


2a Q 


By combining (3.51) and (3.54) as in (3.47), we arrive at an expression for the 
bond price B(0, T). 
Observe that (3.54) does not depend on r(0) and (3.51) is a linear trans- 
formation of r(0). If we set 
(1 a ee) 


A(t,T) = 


R |= 


and 


T u 
CG.) = -a | J e 48) b(s) ds du 
t t 
g 1 r 2 
eee = oo Pag ATN Ze ot 
+o | +55 (1 Í W oi 


then substituting (3.51) and (3.54) in (3.47) produces 
B(0,T) = exp(—A(0, T)r(0) + C(0,T)). 
In fact, the same calculations show that 
Bt, T) = exp(—A(t, T)r(t) + C(t, T)). (3.55) 


In particular, log B(t, T) is a linear transformation of r(t). This feature has 
been generalized by Brown and Schaefer |71| and Duffie and Kan [101] to what 
is generally referred to as the affine class of interest rate models. 

As in our discussion of the Ho-Lee model, the function b(s) can be chosen 
to match a set of prices B(0,T) indexed by T. If we are concerned only with 
matching a finite set of bond prices B(0,t,),..., B(0, tn), then only the values 


of the integrals 
bi+d 
| e~Ht+1—5) b( 3) ds 


ti 
need to be specified. These are precisely the terms u(t, ti+1) needed in the 
simulation algorithm (3.45). Thus, these integrals are by-products of fitting 
the model to a term structure and not additional computations required solely 


for the simulation. 
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Joint Simulation with the Discount Factor 


Most applications that call for simulation of a short rate process r(t) also 
require values of the discount factor 


ra — (- [rw du) 


Y(t) = J sudu 


Given values r(0),r(t1),...,r(tn) of the short rate, one can of course generate 
approximate values of Y (t;) using 


or, equivalently, of 


Sorta); — ty-1], to=0, 


j=l 


or some other approximation to the time integral. But in a Gaussian model, 
the pair (r(t), Y(¢)) are jointly Gaussian and it is often possible to simulate 
paths of the pair without discretization error. To carry this out we simply 
need to find the means, variances, and covariance of the increments of r(t) 


and Y(t). 
We have already determined (see (3.45)) that, given r(t,), 


r(tiga) ~N (ee) + ulti, tigi), oF (ta, ti1)) : 
From the same calculations used in (3.51) and (3.54), we find that, given r(t;) 


and Y (t;), 
Y (tiga) ~ N(Y (te) + bey (te, tea), oF (ta, tag), 


with 
1 —a(ti41—t;) sca hs —a(u—s) 
by (ti, tier) = = (1-e i+ D) rlt) +a e b(s) ds du 
t; ti 


and 
2 _ 
oy (ti, ti+1) = 
2 


o 1 paste) 2 ( —a(ti4i1—t;) ) 
72 (n — ti) + A € e + ze 1) }. 


It only remains to determine the conditional covariance between r(t;+1) 
and Y(ti+1) given (r(t;), Y(t;)). For this we proceed as follows: 
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irae / Collies 


2 t 
o p2(u—t) _ eo a(utt) du 


2a Jo 


ao 


= 0) [1 j eat _ Je 


The required covariance is thus given by 
Be 
Ory (ti, titi) = 57 1 q eT aliit) _ hereda] 
l 2a 


The corresponding correlation is 


iy (titi) 


Vay) Ss 
Pr (ti i+ ) 07 (ti, tizi)oy (ti, ti+1) 


With this notation, the pair (r, Y) can be simulated at times t1,... , tn without 
discretization error using the following algorithm: 


r(tipi) = EO oE,) + (ty, tags) + or (tas tien) Za + 1) 
Y (tigi) = Y (ti) + my (ti, tigi) + oy (ti, tiga) ory (ti, ti+1) Z1 + 1) 


+4/1 — pty (ti, ti+1)Z2(i + 1)], 


where (Z1(7), Zo(z)), i =1,...,n, are independent standard bivariate normal 


random vectors. 


Change of Numeraire 


Thus far, we have considered the dynamics of the short rate r(t) only under 
the risk-neutral measure. Recall that the numeraire asset associated with the 
risk-neutral measure is G(t) = exp/( fo r(u) du) and the defining feature of this 
probability measure is that it makes the discounted bond prices B(t, T)/G(t) 
martingales. In fact, the dynamics of the bond prices under the Gaussian 
models we have considered are of the form (for fixed T) 

dB(t, T) 


a” r(t) dt — A(t, T)o dW (t) (3.56) 


with A(t, T) deterministic; this follows from (3.55). The solution of this equa- 


tion is 


B(t,T) = B(0,T) exp (fro — 40° A*(u, T)] du — o | 


t 


Alu, T) aww) ) 


from which it is evident that 
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Bit, T : : 
(t, T) = B(0, T) exp (407 f A?(u, T) du — o | Alu, T) wu) | 
Be) 0 0 
(3.57) 
is an exponential martingale. 

As discussed in Section 1.2.3, the forward measure for any date Tp is the 
measure associated with taking the Tp-maturity bond B(t, Tr) as numeraire 
asset. The defining feature of the forward measure is that it makes the ratios 
B(t,T)/B(t, Tr) martingales for T < Tp. It is defined by the likelihood ratio 


process 
(Se | = B(t, Tr) B(0) 
dP J, B(t)B(0,Tr)’ 


and this is given in (3.57) up to a factor of 1/B(0,Tr). From Girsanov’s 
Theorem, it follows that the process W/" defined by 


dW?” (t) = dW (t) + c A(t, Tp) dt 


is a standard Brownian motion under Prp. Accordingly, the dynamics of the 
Vasicek model become 


dr(t) = a(b(t) — r(t)) dt + o dW (t) 
oa(b(t) — r(t)) dt + o (dW** (t) — o A(t, Tr) dt) 
= a(b(t) — o° A(t, Tr) — r(t)) dt + odW7* (t). (3.58) 


Thus, under the forward measure, the short rate process remains a Vasicek 
process but the reversion level b(t) becomes b(t) — o° A(t, Tp). 

The process in (3.58) can be simulated using (3.45) with b(t) replaced by 
b(t) — o° A(t, Tr). In particular, we simulate W‘" the way we would simu- 
late any other standard Brownian motion. The simulation algorithm does not 
“know” that it is simulating a Brownian motion under the forward measure 
rather than under the risk-neutral measure. 

Suppose we want to price a derivative security making a payoff of g(r(Tr)) 
at time Tp. Under the risk-neutral measure, we would price the security by 
computing 


E e fy rw) g(r(te))| | 


In fact, g could be a function of the path of r(t) rather than just its terminal 
value. Switching to the forward measure, this becomes 


- [°F r(u) du dP \ 
Er, e g(r(Tr)) dPr, ) r, 


e Ga A (7) (aa 


= B(0,Tr)Er, lg(r(Tr))], 


v uenuerating Sample Paths 


Liv 


where Er, denotes expectation under the forward measure. Thus, we may 
price the derivative security by simulating r(t) under the forward measure 
Pr,, estimating the expectation of g(r(Tr)) and multiplying by B(0, Tr). 
Notice that discounting in this case is deterministic —- we do not need to 
simulate a discount factor. This apparent simplification results from inclusion 
of the additional term —o* A(t, Tp) in the drift of r(t). 

A consequence of working under the forward measure is that the simulation 
prices the bond maturing at Tp exactly: pricing this bond corresponds to tak- 
ing g(r(Tr)) = 1. Again, this apparent simplification is really a consequence 
of the form of the drift of r(t) under the forward measure. 


3.3.3 Multifactor Models 
A general class of Gaussian Markov processes in R? have the form 
dX(t) = C(b— X(t)) dt + DdW(t) (3.59) 


where C and D are d x d matrices, b and X(t) are in Rt, W is a standard 
d-dimensional Brownian motion, and X(0) is Gaussian or constant. Such a 
process remains Gaussian and Markovian if the coefficients C’, b, and D are 
made time-varying but deterministic. The solution of (3.59) is 


t t 
X(t) = eX (0) + f e70 @=s)b ds + f e700- D dW (s), 
0 0 
from which it is possible to define an exact time-discretization similar to (3.45). 

A model of the short rate process can be specified by setting r(t) = a' X(t) 
with a € R? (or with a deterministically time-varying). The elements of X(t) 
are then interpreted as “factors” driving the evolution of the short rate. Be- 
cause each X (t) is normally distributed, r(t) is normally distributed. However, 
r(t) is not in general a Markov process: to make the future evolution of r in- 
dependent of the past, we need to condition on the full state information X (t) 
and not merely r(t). 

Recall from (3.55) that in the Vasicek model (with constant or time-varying 
coefficients), bond prices are exponentials of affine functions of the short rate. 
A similar representation applies if the short rate has the form r(t) = a' X(t) 
and X(t) is as in (3.59); in particular, we have 


B(t, T) = exp(—A(t, T)! X(t) + C(t, T)) 


for some R*-valued function A(t, T) and some scalar function C(t, T). In the 
single-factor setting, differentiating (3.55) and then simplifying leads to 
aB(t, T) 
———+ = r(t)dt — A(t, T)o dW (t 
Fry TTO d- Att T)o a(t, 
with o the diffusion parameter of r(t). The instantaneous correlation between 
the returns on bonds with maturities T; and T> is therefore 
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A(t, T\)o i A(t, T2)o =| 
A2(t, T1)o2/A2(t, T2)o? 


In other words, all bonds are instantaneously perfectly correlated. In the mul- 
tifactor setting, the bond price dynamics are given by 


dB(t,T) 


Be Oe T)' DdW(t). 


The instantaneous correlation for maturities T; and T> is 


A(t, Ti)! DD! A(t, T2) 
| A(é, Ta)" DIIIAŒ, Ta) "DII 
which can certainly take values other than 1. The flexibility to capture less 
than perfect instantaneous correlation between bond returns is the primary 
motivation for considering multifactor models. 

Returning to the general formulation in (3.59), suppose that C can be 
diagonalized in the sense that VCV~! = A for some matrix V and diagonal 
matrix A with diagonal entries \;,...,Ag. Suppose further that C is nonsin- 
gular and define Y(t) = V X(t). Then 


dY (t) = VdX(t) 
= V[C(b — X(t) dt + DdW(t)] 
= (VCb— AY (t)) dt + V D dW (t) 
= A(A7'VCb — Y(t)) dt + VD dW (t) 
= A(Vb — Y (t)) dt + VD dW (t) 
= A(b — Y (t)) dt + dW (t) 


with W a BM(0, X) process, © = VDD'V". It follows that the components 
of (Y1,..., Ya) satisfy 


dY; (t) = à; (b; — Y; (t)) dt +dW,(t), j=1,...,d. (3.60) 


In particular, each Y; is itself a Markov process. The Y; remain coupled, 
however, through the correlation across the components of W. They can be 
simulated as in (3.46) by setting 


Y; (tiga) = 


Pa (iti —t) Y(t) J: (es (ti+1=t:) = 1)b; zi 5- (1 — e~2Aj (tii ts) ) E; (i 4 1), 
J 


where £(1),€(2),... are independent N(0, x) random vectors, €(2) = (& (i), 
.., ali)). Thus, when C is nonsingular and diagonalizable, simulation of 
(3.59) can be reduced to a system of scalar simulations. 


oO eA pe a L GWULLD 


As noted by Andersen and Andreasen [14], a similar reduction is possible 
even if C is not diagonalizable, but at the expense of making all coefficients 
time-dependent. If V(t) is a deterministic d x d matrix-valued function of time 


and we set Y(t) = V(t) X(t), then 


dY (t) = V(t)X (t) dt + V(t)dX (t) 
= [VHX + V(t)C(b — X(t))| dt + V(t)D dW(t), 


where V(t) denotes the time derivative of V(t). If we choose V(t) = exp([C — 


Tt), then | 
V(t) = VQ)C — V(t) 


and thus 


dY (t) = [V(t)Cb — V(t)X (t)] dt + V(t)D dW(t) 
= (b(t) — Y(t)) dt + D(t) dW (t), (3.61) 


with b(t) = V(t)Cb and D(t) = V(t)D. Notice that the drift of each com- 
ponent Y;(t) depends only on that Y;(t). This transformation therefore de- 
couples the drifts of the components of the state vector, making each Y; a 
Markov process, though the components remain linked through the diffusion 
term. We can recover the original state vector by setting X(t) = V(t)~'Y(t) 
because V(t) is always invertible. The seemingly special form of the dynamics 
in (3.61) is thus no less general than the dynamics in (3.59) with time-varying 


coefficients. 


3.4 Square-Root Diffusions 
Feller [118] studied a class of processes that includes the square-root diffusion 
dr(t) = a(b — r(t)) dt + ort) dW (t), (3.62) 


with W a standard one-dimensional Brownian motion. We consider the case 
in which œ and b are positive. If r(0) > 0, then r(t) will never be negative; if 
2ab > o°, then r(t) remains strictly positive for all t, almost surely. 

This process was proposed by Cox, Ingersoll, and Ross [91] as a model 
of the short rate, generally referred to as the CIR model. They developed a 
general equilibrium framework in which if the change in production opportu- 
nities is assumed to follow a process of this form, then the short rate does as 
well. As with the Vasicek model, the form of the drift in (3.62) suggests that 
r(t) is pulled towards b at a speed controlled by œ. In contrast to the Vasicek 
model, in the CIR model the diffusion term o,/r(t) decreases to zero as r(t) 
approaches the origin and this prevents r(t) from taking negative values. This 
‘eature of (3.62) is attractive in modeling interest rates. 
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All of the coefficients in (3.62) could in principle be made time-dependent. 
In practice, it can be particularly useful to replace the constant b with a 
function of time and thus consider 


dr(t) = a(b(t) — r(t)) dt + o y'r (t) dW (t). (3.63) 


As with the Vasicek model, this extension is frequently used to make the bond 


price function 
T 
T=E e -f r(u) au) 
0 


match a set of observed bond prices B(0, T). 

Although we stress the application of (3.63) to interest rate modeling, it 
should be noted that this process has other financial applications. For example, 
Heston [179] proposed a stochastic volatility model in which the price of an 


asset S(t) is governed by 


wo udt + /V(t) dWy(t (3.64) 
dV (t) = a(b — V(t)) dt + o /V(t) dWa(t), (3.65) 


where (W1, W2) is a two-dimensional Brownian motion. Thus, in Heston’s 
model, the squared volatility V (t) follows a square-root diffusion. In addition, 
the process in (3.63) is sometimes used to model a stochastic intensity for a 
jump process in, for example, modeling default. 

A simple Euler discretization of (3.62) suggests simulating r(t) at times 
t1,...,t, by setting 


r(ti+1) = r(t;) F a(b = Puea = t;] + Oy r(ti)t bit = CL TER (3.66) 


with Zi,..., Zn independent N (0,1) random variables. Notice that we have 
taken the positive part of r(t;) inside the square root; some modification of 
this form is necessary because the values of r(t;) produced by Euler discretiza- 
tion may become negative. We will see, however, that this issue can be avoided 
(along with any other discretization error) by sampling from the exact tran- 
sition law of the process. 


3.4.1 Transition Density 


The SDE (3.62) is not explicitly solvable the way those considered in Sec- 
tions 3.2 and 3.3 are; nevertheless, the transition density for the process is 
known. Based on results of Feller [118], Cox et al. [91] noted that the distri- 
bution of r(t) given r(u) for some u < t is, up to a scale factor, a noncentral 
chi-square distribution. This property can be used to simulate the Process 
(3.62). We follow the approach suggested by Scott [324]. 


v uvucalug Sample Paths 


A hed Seal 


A noncentral chi-square random variable x’? (A) with v degrees of freedom 
and noncentrality parameter A has distribution 


P(X) < y) = Foly) 


es 2A) / 1 y 
S (5 J f (v/2)+j-1,—2/2 g 3.67 
= 3 2e /2tIT(E + 9) Jo g i S 


for y > 0. The transition law of r(t) in (3.62) can be expressed as 


o? (1 s emama Age at-u) 


where 
= —, (3.69) 


This says that, given r(u), r(t) is distributed as g?(1 — e~°-”)) /(4a) times 
a noncentral chi-square random variable with d degrees of freedom and non- 


centrality parameter 
dae eltu) 


equivalently, 
4ay 
P(r(t) < ylr(u)) = Frac (ea) , 


with d as in (3.69), A as in (3.70), and Fy2(,) as in (3.67). Thus, we can 
simulate the process (3.62) exactly on a discrete time grid provided we can 
sample from the noncentral chi-square distribution. 

Like the Vasicek model, the square-root diffusion (3.62) has a limiting 
stationary distribution. If we let t — oo in (3.68), we find that r(t) converges 
in distribution to o7/4a times a noncentral chi-square random variable with 
d degrees of freedom and noncentrality parameter 0 (making it an ordinary 
chi-square random variable). This is a stationary distribution in the sense that 
if r(0) is drawn from this distribution, then r(t) has the same distribution for 


all t. 


Chi-Square and Noncentral Chi-Square 


If v is a positive integer and Z1,...,Z, are independent N(0,1) random vari- 
ables, then the distribution of 


Zit Z+ t Z? 


is called the chi-square distribution with v degrees of freedom. The symbol x? 
denotes a random variable with this distribution; the prime in y/7(A) empha- 
sizes that this symbol refers to the noncentral case. The chi-square distribution 


is given by 


3.4 Square-Root Diffusions 123 


1 y 
where [I (-) denotes the gamma function and T(n) = (n — 1)! if n is a positive 
integer. This expression defines a valid probability distribution for all v > 0 
and thus extends the definition of x2 to non-integer v. 

For integer v and constants aj,...,a,, the distribution of 


S14, + ai)? (3.72) 


is noncentral chi-square with v degrees of freedom and noncentrality para- 
meter \ = `; a?. This representation explains the term “noncentral.” The 
distribution in (3.67) extends the definition to non-integer v. 

It follows from the representation in (3.72) that if v > 1 is an integer, then 


HASA Ao 


meaning that the two sides have the same distribution when the random 
variables on the right are independent of each other. As discussed in Johnson 
et al. [202, p.436], this representation is valid even for non-integer v > 1. Thus, 
to generate x/?(A), v > 1, it suffices to generate x2_; and an independent 
N(0,1) random variable Z and to set 


XP (A) = (Z+ Vd)? + 2-1. (3.73) 


This reduces sampling of a noncentral chi-square to sampling of an ordinary 
chi-square (and an independent normal) when v > 1. 

For any v > 0, (3.67) indicates that a noncentral chi-square random vari- 
able can be represented as an ordinary chi-square random variable with a 
random degrees-of-freedom parameter. In more detail, if N is a Poisson ran- 
dom variable with mean 4/2, then 


Lye (A/2)? 
= — 


P(N =j) , 3=0,1,2,.... 


Consider now a random variable x2, >y with N having this Poisson distribu- 
tion. Conditional on N = j, the random variable has an ordinary chi-square 
distribution with v + 27 degrees of freedom: 


1 4 , 
2 = o —z/2 [(v/2)+j—1 
Polson SUN =i) = T l, = 


The unconditional distribution is thus given by 


Me 


l O e OW 2) 
P(N =) Poyan < viN = 3) = re? EE" Pod ay < y) 
j=0 7 


So. 
Il 
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which is precisely the noncentral chi-square distribution in (3.67). We may 
therefore sample x’? (A) by first generating a Poisson random variable N and 
then, conditional on N, sampling a chi-square random variable with v + 2N 
degrees of freedom. This reduces sampling of a noncentral chi-square to sam- 
pling of an ordinary chi-square and a Poisson random variable. We discuss 
methods for sampling from these distributions below. Figure 3.5 summarizes 
their use in simulating the square-root diffusion (3.62). 


Simulation of dr(t) = a(b — r(t)) dt + a,/r(t) dW(t) 
on time grid 0 = to < tı <: < tn with d = 4ba/o” 
Case 1:d> 1 
fori=0,...,n—-—1 
ce g? (1 — e7% -t:))/ (4a) 
Xea r(ti)(e7%ti+1=t))/c 
generate Z ~ N (0,1) 
generate X ~ x3] 


r(ti+1) — c[(Z + VA)? + X] 


end 


Case 2: d < 1 
for i = 0,...,n— 1 

eeso (1— e ORN da) 
Nax r(ti) (e721 ti) e 
generate N ~ Poisson(\/2) 
generate X ~ y4 42N 

r(ti+1) s CX 


Fig. 3.5. Simulation of square-root diffusion (3.62) by sampling from the transition 


density. 


Figure 3.6 compares the exact distribution of r(t) with the distribution 
produced by the Euler discretization (3.66) after a single time step. The com- 
parison is based on a@ = 0.2, 0 = 0.1, b = 5%, and r(0) = 4%; the left panel 
takes t = 0.25 and the right panel takes t = 1. These values for the model 
parameters are sensible for an interest rate model if time is measured in years, 
so the values of t should be interpreted as a quarter of a year and a full year, 
respectively. The figures suggest that the Euler discretization produces too 
many values close to or below 0 and a mode to the right of the true mode. 
The effect if particularly pronounced over the rather large time step t = 1. 
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Fig. 3.6. Comparison of exact distribution (solid) and one-step Euler approximation 
(dashed) for a square-root diffusion with a = 0.2, o = 0.1, b = 5%, and r(0) = 4%. 
The left panel compares distributions at t = 0.25, the right panel at t = 1. 


3.4.2 Sampling Gamma and Poisson 


The discussion leading to Figure 3.5 reduces the problem of simulating the 
square-root diffusion (3.62) to one of sampling from a chi-square distribution 
and possibly also the normal and Poisson distributions. We discussed sampling 
from the normal distribution in Section 2.3; we now consider methods for 
sampling from the chi-square and Poisson distributions. 


Gamma Distribution 


The gamma distribution with shape parameter a and scale parameter ( has 


density 


f(y) = fa ply) = G Z yle” P, y>0. (3.74) 


It has mean aß and variance aĝ8°. Comparison with (3.71) reveals that the 
chi-square distribution is the special case of scale parameter 8 = 2 and shape 
parameter a = v/2. We therefore consider the more general problem of gen- 
erating samples from gamma distributions. 

Methods for sampling from the gamma distribution typically distinguish 
the cases a < 1 and a > 1. For the application to the square-root diffusion 
(3.62), the shape parameter a is given by d/2 with d as in (3.69). At least in 
the case of an interest rate model, d would typically be larger than 2 so the 
case a > 1 is most relevant. We include the case a < 1 for completeness and 
other potential applications. There is no loss of generality in fixing the scale 
parameter 8 at 1: if X has the gamma distribution with parameters (a, 1), 
then 6X has the gamma distribution with parameters (a, 8). 

Cheng and Feast [83] develop a method based on a general approach to 
random variate generation known as the ratio-of-uniforms method. The ratio- 


of-uniforms method is closely related to the acceptance-rejection method dis- 
A in Gnatinn 999 Tt avninita tha fallaurine nranertry Siunnage f ig aA 
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nonnegative, integrable function on [0, 00); if (X,Y) is uniformly distributed 
over the set A = {(z,y): 2 < \/f(y/xz)}, then the density of Y/X is propor- 
tional to f. (See p.180 of Fishman [121] or p.59 of Gentle [136].) Suppose A 
is contained in a bounded rectangle. Then to sample uniformly from A, we 
can repeatedly sample pairs (X,Y) uniformly over the rectangle and keep the 
first one that satisfies X < ,/f(Y/X). The ratio-of-uniforms method delivers 
Y/X as a sample from the density proportional to f. 
To sample from the gamma density with a > 1, define 


A = $ (2,9) :0 < æ < yyl) te=]. 


This set is contained in the rectangle [0, z] x [0, y] with z = [(a — 1)/e]78/2 
and y = [(a + 1)/e](*+)/2. Sampling uniformly over this rectangle, the ex- 
pected number of samples needed until one lands in A is given by the ratio of 
the area of A to that of the rectangle. As shown in Fishman [121], this ratio 
is O(,/a), so the time required to generate a sample using this method grows 
with the shape parameter. Cheng and Feast [83] and Fishman [121] develop 
modifications of this basic approach that accelerate sampling. In Figure 3.7, 
which is Fishman’s Algorithm GKM1, the first acceptance test is a fast check 
that reduces the number of logarithmic evaluations. When many samples are 
to be generated using the same shape parameter (as would be the case in the 
application to the square-root diffusion), the constants in the setup step in 
Figure 3.8 should be computed just once and then passed as arguments to 
the sampling routine. For large values of the shape parameter a, Algorithm 
GKM2 in Fishman [121] is faster than the method in Figure 3.7. 


Setup: ā — a — 1, b — (a — (1/(6a)))/a, m — 2/4, d =- m +2 
repeat 

generate U1, U2 ~ Unif[0,1] 

V — bU2/U: 


if mUı —d + V + (1/V) < 0, accept 

elseif m log U1 — log V + V — 1 < 0, accept 
until accept 
return Z — aV 


Fig. 3.7. Algorithm GKM1 from Fishman [121], based on Cheng and Feast [83], for 
sampling from the gamma distribution with parameters (a, 1), a > 1. 


Ahrens and Dieter [6] provide a fast acceptance-rejection algorithm for 
the case a < 1. Their method generates candidates by sampling from distri- 
butions concentrated on [0,1] and (1,00) with appropriate probabilities. In 
more detail, let p = e/(a + e) (e = exp(1)) and define 
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ee pane 1, O<z<l 
947") (1 — ple24}, z > 1. 


This is a probability density; it is a mixture of the densities az?! on [0,1] and 
e~**! on (1,00), with weights p and (1 — p), respectively. We can sample from 
g by sampling from each of these densities with the corresponding probabili- 
ties. Each of these two densities is easily sampled using the inverse transform 
method : for the density az?—! on [0, 1] we can use U!/*, U ~ Unif[0,1]; for the 
density e~**! on (1,00) we can use 1 — log(U). Samples from g are suitable 
candidates for acceptance-rejection because the ratio fai(z)/g(z) with fa. a 
gamma density as in (3.74) is bounded. Inspection of this ratio indicates that 
a candidate Z in [0,1] is accepted with probability e~* and a candidate in 
(1, 00) is accepted with probability Z°71. A global bound on the ratio is given 


by 


fa,i(z)/g(z) < zI a) < 1.39; 


recall from Section 2.2.2 that the upper bound on this ratio determines the 
expected number of candidates generated per accepted sample. 

Figure 3.8 displays the method of Ahrens and Dieter [6]. The figure is based 
on Algorithm GS* in Fishman [121] but it makes the acceptance tests more 
explicit, if perhaps slightly slower. Notice that if the condition Y < 1 fails to 
hold, then Y is uniformly distributed over |1,}]; this means that (b — Y)/a 
has the distribution of U/e, U ~ Unif[0,1] and thus — log((b — Y)/a) has the 
distribution of 1 — log(U). 


Setup: b — (a+ e)/e 
repeat 
generate U1, U2 ~ Unif[0,1]; Y — bU: 
if Y <1 
Yi y1/a 


if U2 < exp(—Z), accept 
otherwise Z + —log((b — Y )/a) 
if U2 < Z°—", accept 
until accept 
return Z 


Fig. 3.8. Ahrens-Dieter method for sampling from the gamma distribution with 
parameters (a,1), a < 1. 


Poisson Distribution 


The Poisson distribution with mean 0 > 0 is given by 
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k 

P(N =k) =e °F, kend eee (3.75) 
We abbreviate this by writing N ~ Poisson(@). This is the distribution of 
the number of events in [0,1] when the times between consecutive events are 
independent and exponentially distributed with mean 1/0. Thus, a simple 
method for generating Poisson samples is to generate exponential random 
variables X; = —log(U;)/@ from independent uniforms U; and then take N to 
be the largest integer for which X,+---+Xw,» < 1. This method is rather slow, 
especially if 0 is large. In the intended application in Figure 3.5, the mean of 
the Poisson random variable — equal to half the noncentrality parameter in 
the transition density of the square-root diffusion — could be quite large for 
plausible parameter values. 

An alternative is to use the inverse transform method. For discrete dis- 
tributions, this amounts to a sequential search for the smallest n at which 
F(n) < U, where F denotes the cumulative distribution function and U 
is Unif/0,1]. In the case of a Poisson distribution, F(n) is calculated as 
P(N = 0)+.---+ P(N = n); rather than calculate each term in this sum 
using (3.75), we can use the relation P(N = k +1) = P(N = k)6/(k +1). 
Figure 3.9 illustrates the method. 


p+ exp(—0), F p 
N0 

generate U ~ Unif[0,1] 
while U > F 


N NFI 

p< po/N 

FeeF+p 
return N 


Fig. 3.9. Inverse transform method for sampling from Poisson(@), the Poisson dis- 
tribution with mean ð. 


3.4.3 Bond Prices 


Cox, Ingersoll, and Ross [91] derived an expression for the price of a bond 


BT Sr E (-/ r(u) au) ro 


when the short rate evolves according to (3.62). The bond price has the ex- 
ponential affine form 
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B(t,T) = e74 Trt) +OET) 


as in a Gaussian short rate model, but with 


evra) 


A(t, T) = ———— 
= GF elem — 1) 45 


and 


2 (a+y)(T—t)/2 
G Fa em 


Pp 

and y = Va’ + 20°. 
This expression for the bond price is a special case of a more general 
result, given as Proposition 6.2.5 in Lamberton and Lapeyre [218]. This result 
gives the bivariate Laplace transform of the short rate and its integral: for 


nonnegative A, 0, 


E l (xe) — af r(u) du) rol = exp(—abypı(T — t) — r(t)y2(T — t)) 


(3.76) 
with 
9 In (O)ela+7(9)) 8/2 
b1(s) = =- log pase O o ? 
o o? \(eV(9)s — 1) + y(0)— a + e)s (y(0) +a) 
and 


o) O ta AEO) — a) +AA 
pals) = a? Afers = 1) + ~y(0) = e7(9)8 (+ (8) + a) 


and y(@) = va? + 2076. The bond pricing formula is the special case A = 0, 
p= 

The bivariate Laplace transform in (3.76) characterizes the joint distri- 
bution of the short rate and its integral. This makes it possible, at least 
in principle, to sample from the joint distribution of (r(ti+1), Y (ti+1) given 
(r(ti), Y (ti}) with 


Y= [rw du. 


As explained in Section 3.3.2, this would allow exact simulation of the short 
rate and the discount factor on a discrete time grid. In the Gaussian setting, 
the joint distribution of r(t) and Y(t) is normal and therefore easy to sample; 
in contrast, the joint distribution determined by (3.76) is not explicitly avail- 
able. Scott [324] derives the Laplace transform of the conditional distribution 
of Y(t:41) — Y (ti) given r(t,) and r(t;41), and explains how to use numerical 
transform inversion to sample from the conditional distribution. Through this 
method, he is able to simulate (r(¢;), Y(¢;)) without discretization error. 
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Time-Dependent Coefficients 


As noted earlier, the parameter 6 is often replaced with a deterministic func- 
tion of time b(t) in order to calibrate the model to an initial term structure, 
resulting in the dynamics specified in (3.63). In this more general setting, a 
result of the form in (3.76) continues to hold but with functions Y% and Y de- 
pending on both ¢ and T rather than merely on T—t. Moreover, these functions 
will not in general be available in closed form, but are instead characterized by 
a system of ordinary differential equations. By solving these differential equa- 
tions numerically, it then becomes possible to compute bond prices. Indeed, 
bond prices continue to have the exponential affine form, though the functions 
A(t, T) and C(t, T) in the exponent are no longer available explicitly but are 
also determined through ordinary differential equations (see Duffie, Pan, Sin- 
gleton [105] and Jamshidian [195]). This makes it possible to use a numerical 
procedure to choose the function b(t) to match an initial set of bond prices 
B(0,T). 

Once the constant 6 is replaced with a function of time, the transition 
density of the short rate process ceases to admit the relatively tractable form 
discussed in Section 3.4.1. One can of course simulate using an Euler scheme 


of the form 
r(ti+1) = r(t;) H alb(ti) Te r(ti)lti+1 = til + O Ay ri) liii oe A PET 


with independent Z; ~ N (0,1). However, it seems preferable (at least from a 
distributional perspective) to replace this normal approximation to the tran- 
sition law with a noncentral chi-square approximation. For example, if we 


let 
: 1 tipi 
T E J Beas 
titi — ti Je, 


denote the average level of b(t) over |ti, ti+1] (assumed positive), then (3.68) 
suggests simulating by setting 


g2 1 — e~ 2lti+ı—ti) Age «(ti+1—-t) 

rte) a ok ali A (aa yt) ) (3.77) 
with d = 4ba/o*. We can sample from the indicated noncentral chi-square 
distribution using the methods discussed in Section 3.4.1. However, it must be 
stressed that whereas (3.68) is an exact representation in the case of constant 
coefficients, (3.77) is only an approximate procedure. If it suffices to choose 
the function b(t) to match only bonds maturing at the simulation grid dates 
t1,...,tn, then it may be possible to choose b to be constant over each interval 
(ti, ti41], in which case (3.77) becomes exact. 

Jamshidian [195] shows that if a, 6, and o are all deterministic functions 
of time, the transition density of r(t) can be represented through a noncen- 
tral chi-square distribution provided a(t)b(t)/o7(t) is independent of t. From 
(3.69) we see that this is equivalent to requiring that the degrees-of-freedom 
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parameter d = 4ba/c* be constant. However, in this setting, the other para- 
meters of the transition density are not given explicitly but rather as solutions 
to ordinary differential equations. 


Change of Numeraire 


Recall from Sections 1.2.3 and 3.3.2 that the forward measure for any date Tp 
is the measure associated with taking as numeraire asset the bond B(t, Tr) 
maturing at Tr. We saw in Section 3.3.2 that if the short rate follows an 
Ornstein-Uhlenbeck process under the risk-neutral measure, then it continues 
to follow an OU process under a forward measure. An analogous property 
holds if the short rate follows a square-root diffusion. 

Most of the development leading to (3.58) results from the exponential 
affine formula for bond prices and thus extends to the square-root model. In 
this setting, the bond price dynamics become 

dB(t, T) 


Bad = r(t) dt — A(t, T)o\/r(t) dW (t); 


in particular, the coefficient o\/r(t) replaces the o of the Gaussian case. Pro- 
ceeding as in (3.56)-(3.58) but with this substitution, we observe that Gir- 


sanov’s Theorem implies that the process WTF defined by 
dW*F (t) = dW(t) + oy r(t) A(t, Tp) dt 


is a standard Brownian motion under the forward measure Pr,.. The dynamics 
of the short rate thus become 


dr(t) = a(b(t) — r(t)) dt + ov/r(t) dW (t) 
= a(b(t) — r(t)) dt + ov/r(t)[dW7* (t) — oyr (t) A(t, Tr) dt] 
= a(b(t) — (1+ o? A(t, Tr))r(t)] dt + o yr (t) dWT (t). 


This can be written as 


on a r) dt + oy r(t) dW7 (t), 


dr(t) = a(1 + o° A(t, Tr)) (sinha 


which shows that under the forward measure the short rate is again a square- 
root diffusion but one in which both the level to which the process reverts and 
the speed with which it reverts are functions of time. The pricing of derivative 
securities through simulation in the forward measure works the same way here 


as in the Vasicek model. 


3.4.4 Extensions 


In this section we consider further properties and extensions of the square-root 
diffusion. We discuss multifactor models, a connection with squared Gaussian 
processes, and a connection with CEV (constant elasticity of variance) models. 
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Multifactor Models 


The simplest multifactor extension of the CIR interest rate model defines 
independent processes 


dX; (t) = au; (0; = X;(t)) dt = a oiy Xift) dW; (t), LEs I; iraia d, 


and takes the short rate to be r(t) = X(t) +- + Xq(t). Much as in the 
discussion of Section 3.3.3, this extension allows imperfect instantaneous cor- 
relation among bonds of different maturities. Each X; can be simulated using 
the method developed in the previous sections for a single-factor model. 

It is possible to consider more general models in which the underlying 
processes X1,..., Xq are correlated. However, once one goes beyond the case 
of independent square-root factors, it seems more natural to move directly to 
the full generality of the affine models characterized by Duffie and Kan [101]. 
This class of models has a fair amount of tractability and computationally 
attractive features, but we will not consider it further here. 


Squared Gaussian Models 


We next point out a connection between the (single-factor) square-root diffu- 
sion and a Gaussian model of the type considered in Section 3.3. This connec- 
tion is of intrinsic interest, it sheds further light on the simulation procedure 
of Section 3.4.1, and it suggests a wider class of interest rate models. The 
link between the CIR model and squared Gaussian models is noted in Rogers 
[307]; related connections are developed in depth by Revuz and Yor [306] in 
their discussion of Bessel processes. 

Let X,(t),...,Xa(t) be independent Ornstein-Uhlenbeck processes of the 
form 


dX;(t) = -5Xi(t) dt + 5 aWi(t), on eee A 
for some constants a, o, and independent Brownian motions Wj,..., Wa. Let 
Y(t) = X? (t) +--+ X(t); then Itô’s formula gives 


2 


dY (t) = Y(X; (t) dX;(t) + T dt) 
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If we now define 


X;(t) 


d 
WO = 2 eG 


dW; (t), 
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then W(t) is a standard Brownian motion because the vector (Xi(t),..., 
Xa(t))/ VY (t) multiplying (dW, (t),...,dWa(t))' has norm 1 for all t. Hence, 


dY(t) =a (z i r) dt + oV Y (t) dW(t), 


which has the form of (3.62) with b = o7d/4a. 

Starting from (3.62) and reversing these steps, we find that we can con- 
struct a square-root diffusion as a sum of squared independent Ornstein- 
Uhlenbeck processes provided d = 4ba/a? is an integer. Observe that this is 
precisely the degrees-of-freedom parameter in (3.69). In short, a square-root 
diffusion with an integer degrees-of-freedom parameter is a sum of squared 
Gaussian processes. 

We can use this construction from Gaussian processes to simulate r(t) in 
(3.62) if d is an integer. Writing r(t;+1) as aa X? (ti+ı1) and using (3.45) 
for the one-step evolution of the X;, we arrive at 


2 
dee: | 1 i 
r(ti+ı) = y ow /r(t;)/d dl 5 7 _ otaota) : 


j=1 


where ( as iss Ze ) are standard normal d-vectors, independent for different 
values of 7. Comparison with (3.72) reveals that the expression on the right is a 
scalar multiple of a noncentral chi-square random variable, so this construction 
is really just a special case of the method in Section 3.4.1. It sheds some light 
on the appearance of the noncentral chi-square distribution in the law of r(t). 

This construction also points to another strategy for constructing inter- 
est rate models: rather than restricting ourselves to a sum of independent, 
identical squared OU processes, we can consider other quadratic functions 
of multivariate Gaussian processes. This idea has been developed in Beagle- 
hole and Tenney [42] and Jamshidian [196]. The resulting models are closely 
related to the affine family. 


CEV Process 


We conclude this section with a digression away from interest rate models 
to consider a class of asset price processes closely related to the square-root 


diffusion. 
Among the important alternatives to the lognormal model for an asset 


price considered in Section 3.2 is the constant elasticity of variance (CEV) 
process (see Cox and Ross [89], Schroder [322] and references there) 


dS(t) = uS (t) dt + o S(t) dW (t). (3.78) 


This includes geometric Brownian motion as the special case G = 2; some 
empirical studies have found that G < 2 gives a better fit to stock price data. 
If we write the model as 
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dS(t 2 
an = pdt +oaS(t)¥-?/? dW (t), 


we see that the instantaneous volatility o S(t) $722 depends on the current 
level of the asset, and @ < 2 implies a negative relation between the price level 


and volatility. 
If we set X(t) = S(t)?-% and apply Itô’s formula, we find that 


ix(t) = |[L(2—-B\1— 8) + we- DXO d+ o(2 -AVX aW), 


revealing that X(t) is a square-root diffusion. For u > 0 and 1 < 8 < 2, we can 
use the method of the Section 3.4.1 to simulate X(t) on a discrete time grid 
and then invert the transformation from S to X to get S(t) = X(t)/@-%). 
The case 8 < 1 presents special complications because of the behavior of S 
near 0; simulation of this case is investigated in Andersen and Andreasen [13]. 


3.5 Processes with Jumps 


Although the vast majority of models used in derivatives pricing assume that 
the underlying assets have continuous sample paths, many studies have found 
evidence of the importance of jumps in prices and have advocated the inclu- 
sion of jumps in pricing models. Compared with a normal distribution, the 
logarithm of a price process with jumps is often leptokurtotic, meaning that it 
has a high peak and heavy tails, features typical of market data. In this sec- 
tion we discuss a few relatively simple models with jumps, highlighting issues 
that affect the implementation of Monte Carlo. 


3.5.1 A Jump-Diffusion Model 


Merton [263] introduced and analyzed one of the first models with both jump 
and diffusion terms for the pricing of derivative securities. Merton applied this 
model to options on stocks and interpreted the jumps as idiosyncratic shocks 
affecting an individual company but not the market as a whole. Similar models 
have subsequently been applied to indices, exchange rates, commodity prices, 


and interest rates. 
Merton’s jump-diffusion model can be specified through the SDE 


dS (t 
cau] = udt +o dW (t) + dJ(t) (3.79) 
S(t—) 
where u and o are constants, W is a standard one-dimensional Brownian 
motion, and J is a process independent of W with piecewise constant sample 


paths. In particular, J is given by 
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N(t) 
I(t) = 0% -1) (3.80) 
j=l 
where Yj, Y2,... are random variables and N(t) is a counting process. This 


means that there are random arrival times 
OSALT 


and 

N (t) = sup{n : mn < t} 
counts the number of arrivals in [0, t]. The symbol dJ(t) in (3.79) stands for 
the jump in J at time t. The size of this jump is Y; — 1 if t = % and 0 if t 
does not coincide with any of the 7;. 

In the presence of jumps, a symbol like S(t) is potentially ambiguous: if 
it is possible for S to jump at t, we need to specify whether S(t) means the 
value of S just before or just after the jump. We follow the usual convention 
of assuming that our processes are continuous from the right, so 

S(t) = lim S(u) 
ult 
includes the effect of any jump at t. To specify the value just before a potential ` 
jump we write S(t—), which is the limit 


S(t—) = lim S(u) 


uft 


from the left. 
If we write (3.79) as 


dS(t) = uS(t—) dt + oS(t-) dW (t) + S(t-) dI (t), 


we see that the increment dS(t) in S at t depends on the value of S just before 
a potential jump at t and not on the value just after the jump. This is as it 
should be. The jump in S at time t is S(t) — S(t—). This is 0 unless J jumps 
at t, which is to say unless t = 7; for some j. The jump in S at 7; is 


S(7j) = Sg) = Sj) 1d y) = J) = 8-0 1), 


hence 
S(7j) = S(Tj—)Y;. 
This reveals that the Y; are the ratios of the asset price before and after a 
jump — the jumps are multiplicative. This also explains why we wrote Y; — 1 
rather than simply Y; in (3.80). 
By restricting the Y; to be positive random variables, we ensure that S(t) 
can never become negative. In this case, we see that 
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log S(7;) = log S(7;—) + log Y;, 


so the jumps are additive in the logarithm of the price. Additive jumps are a 
natural extension of Brownian motion and multiplicative jumps (as in (3.79)) 
provide a more natural extension of geometric Brownian motion; see the dis- 
cussion at the beginning of Section 3.2. The solution of (3.79) is given by 


N(t) 
1 2 
TSS Oe 2s ee TTA, (3.81) 


j=l 


which evidently generalizes the corresponding solution for geometric Brownian 


motion. 

Thus far, we have not imposed any distributional assumptions on the jump 
process J(t). We now consider the simplest model — the one studied by 
Merton [263] — which takes N(t) to be a Poisson process with rate A. This 
makes the interarrival times 7;41 —7; independent with a common exponential 
distribution, 

Prj- r St) =le, t > 0. 
We further assume that the Y; are i.i.d. and independent of N (as well as W). 
Under these assumptions, J is called a compound Poisson process. 

As noted by Merton [263], the model is particularly tractable when the Y; 
are lognormally distributed, because a product of lognormal random variables 
is itself lognormal. In more detail, if Y; ~ LN (a,b?) (so that log Y; ~ N(a, b*)) 
then for any fixed n, 

I] Y; ~ LN(an, bn). 
j=l 
It follows that, conditional on N(t) = n, S(t) has the distribution of 


138 fe 
S(0)e 27 HOW T| Y; ~ (0) - LN ((u — $07)t, °t) - LN (an, bn) 
jsi 
= LN (log S(0) + (u — 40° jt + an, o°t + b’n), 


using the independence of the Y; and W. If we let Fn,¿ denote this lognormal 
distribution (cf. Section 3.2.1) and recall that N (t) has a Poisson distribution 
with mean At, then from the Poisson probabilities (3.75) we find that the 
unconditional distribution of S(t) is 


P(S(t) <a) = vD pa n,t(£), 


n=0 


a Poisson mixture of lognormal distributions. Merton [263] used this property 
to express the price of an option on S as an infinite series, each term of which 
is the product of a Poisson probability and a Black-Scholes formula. 
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Recall that in the absence of jumps the drift u in (3.79) would be the 
risk-free rate, assuming the asset pays no dividends and assuming the model 
represents the dynamics under the risk-neutral measure. Suppose, for simplic- 
ity, that the risk-free rate is a constant r; then the drift is determined by the 
condition that S(t)e~™ be a martingale. Merton [263] extends this principle 
to his jump-diffusion model under the assumption that jumps are specific to 
a single stock and can be diversified away; that is, by assuming that the mar- 
ket does not compensate investors for bearing the risk of jumps. We briefly 
describe how this assumption determines the drift parameter p in (3.79). 

A standard property of the Poisson process is that N(t)—At is a martingale. 
A generalization of this property is that 


N(t) 


S. (Ws) — AEA) 


is a martingale for i.i.d. Y, Yı, Yo and any function h for which E[h(Y)] is finite. 
Accordingly, the process 
J(t) — Amt 

is a martingale if m = E[Y;] — 1. The choice of drift parameter in (3.79) that 
makes S(t)e~™ a martingale is therefore u = r—Am. In this case, if we rewrite 
(3.79) as 

dS (t) 

BF) 
the last two terms on the right are martingales and the net growth rate in 
S(t) is indeed r. 

With this notation and with log Y; ~ N (a, b°), Merton’s [263] option pric- 

ing formula becomes 


= rdt + o dW (t) + [dJ (t) — Am dt], 


= (At)” 


ePE[(S(T) — K)*] = Spe CDT reser) — K)HIN(T) = r] 
——. yr, (A't)” 
= 2: A OUT BS(S(0), on, T, rn, K), 


where X = A1 +m), o2 = 0? +b°n/T,; rn =r — Am + nlog(1 + m)/T, and 
BS(-) denotes the Black-Scholes call option formula (1.4). 


Simulating at Fixed Dates 


We consider two approaches to simulating the jump-diffusion model (3.79), 
each of which is an instance of a more general strategy for simulating a broader 
class of jump-diffusion models. In the first method, we simulate the process 
at a fixed set of dates 0 = tọ < tı <- < tn without explicitly distinguishing 
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the effects of the jump and diffusion terms. In the second method, we simulate 
the jump times 7), 72,... explicitly. 

We continue to assume that N is a Poisson process, that Y1, Yo,... are 
i.i.d., and that N, W, and {Y1, Yo,...} are mutually independent. We do not 
assume the Y; are lognormally distributed, though that will constitute an 
interesting special case. 


To simulate S(t) at time t),...,tn, we generalize (3.81) to 
T N(ti+1) 
S(tsz1) = S(ty)e4" 27 )(tiz1—ts)+o[W (ti41)-W(te)] I] Y;, 
j=N(ti)+1 


with the usual convention that the product over j is equal to 1 if N(t;41) = 
N (ti). We can simulate directly from this representation or else set X(t) = 
log S(t) and 
N (ti+1) 
X (tes) = X (ti) + (u -40° (titi — ti) +0[W (ti) -W(t + $ logy; 
jJ=N(ti)+1 
(3.82) 
this recursion replaces products with sums and is preferable, at least if sam- 
pling log Y; is no slower than sampling Y;. We can exponentiate simulated 
values of the X(t;) to produce samples of the S(t;). 
A general method for simulating (3.82) from t; to ti+ı consists of the 
following steps: 
1. generate Z ~ N(0, 1) 
2. generate N ~ Poisson(A(ti41 — t;)) (see Figure 3.9); if N = 0, set M =0 
and go to Step 4 


3. generate log Y1,...,log Yy from their common distribution and set M = 
log Yı +...+log Yn l 
4. set 


X (ti41) = X (ti) zA (u = 50) (ti44 = ti) + Or/ ti = t;Z + M. 


This method relies on two properties of the Poisson process: the increment 
N (ti+1) — N (ti) has a Poisson distribution with mean X(ti+1 — t;), and it is 
independent of increments of N over {0, t;]. 

Under further assumptions on the distribution of the Y;, this method can 
sometimes be simplified. If the Y; have the lognormal distribution LN (a, b°), 


then log Y; ~ N (a, b°) and 


X logY; ~ N(an, b?n) = an + byn N (0,1). 


j=1 


In this case, we may therefore replace Step 3 with the following: 
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3’. generate Z2 ~ N(0,1); set M = aN +bVNZ_ 


If the log Y; have a gamma distribution with shape parameter a and scale 
parameter ( (see (3.74)), then 


log Yı + log Y2 +- + log Yn 


has the gamma distribution with shape parameter an and scale parameter 
B. Consequently, in Step 3 above we may sample M directly from a gamma 
distribution, conditional on the value of N. 

Kou [215] proposes and analyzes a model in which |log Y;| has a gamma 
distribution (in fact exponential) and the sign of log Y; is positive with proba- 
bility q, negative with probability 1—q. In this case, conditional on the Poisson 
random variable N taking the value n, the number of log Y; with positive sign 
has a binomial distribution with parameters n and q. Step 3 can therefore be 
replaced with the following: 


3a”. generate K ~ Binomial(N,q) 
3b”. generate Ry ~ Gamma(Ka,ĝ) and Ro ~ Gamma((N — K)a,G) and set 
M = Ri — Ko 


In 3b”, interpret a gamma random variable with shape parameter zero as 
the constant 0 in case K = 0 or K = N. In 3a”, conditional on N = n, the 
binomial distribution of K is given by 
jok k=0,1,... n. 


n! 


OE =) = Tl 


Samples from this distribution can be generated using essentially the same 
method used for the Poisson distribution in Figure 3.9 by changing just the 
first and sixth lines of that algorithm. In the first line, replace the mass at the 
origin exp(—@) for the Poisson distribution with the corresponding value (1 — 
q)” for the binomial distribution. Observe that the ratio P(K = k)/P(K = 
k — 1) is given by q(n + 1 —k)/k(1 — q), so the sixth line of the algorithm 
becomes p — pq(n + 1— N)/N(1 — q) (where N now refers to the binomial 
random variable produced by the algorithm). 


Simulating Jump Times 


Simulation methods based on (3.82) produce values S(t¢;) = exp(X(t;)), 
i = 1,...,n, with the exact joint distribution of the target process (3.79) 
at dates t;,...,t,. Notice, however, that this approach does not identify the 
times at which S(t) jumps; rather, it generates the total number of jumps in 
each interval (t;, ti+1], using the fact that the number of jumps has a Poisson 
distribution. 

An alternative approach to simulating (3.79) simulates the jump times 
T,72,... explicitly. From one jump time to the next, S(t) evolves like an 
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ordinary geometric Brownian motion because we have assumed that W and 
J in (3.79) are independent of each other. It follows that, conditional on the 


times 7],72,... of the jumps, 


S(t) = S (ry) eH BOTH THF OTW (543 )—W Cr) 


and 
S(7j41) = S(tj41-) V4. 


Taking logarithms and combining these steps, we get 
X (Tj41) = X (75) + (e — 50°) (T5412 — Ti) + O[W (T3441) — W (7;)] + log Yj+1. 


A general scheme for simulating one step of this recursion now takes the 
following form: 


1. generate R41 from the exponential distribution with mean 1/A 
2. generate Z;41 ~ N(0, 1) 

3. generate log Yj41 

4. set To4+1 = Tj + Ry41 and 


X (Tj+1) = X (73) + (u — 0° )Rj+1 +o Ry 41 Zj41 + log Yj41. 


Recall from Section 2.2.1 that the exponential random variable R;.; can be 
generated by setting Rj; = —log(U)/A with U ~ Unif[0,1]. 

The two approaches to simulating S(t) can be combined. For example, 
suppose we fix a date t in advance that we would like to include among the 


simulated dates. Suppose it happens that 7; < t < 7;4, (ie, N(t) = N(t—) = 


j). Then 
SCS S(r; je t737 E-r) Fol E-W (r3) 


and i 
ee e EE 
Sra Sea )(Tj+1—t)+0o[W (7541) PIY i 


Both approaches to simulating the basic jump-diffusion process (3.79) — 
simulating the number of jumps in fixed subintervals and simulating the times 
at which jumps occur — can be useful at least as approximations in simulating 
more general jump-diffusion models. Exact simulation becomes difficult when 
the times of the jumps and the evolution of the process between jumps are no 
longer independent of each other. 


Inhomogeneous Poisson Process 


A simple extension of the jump-diffusion model (3.79) replaces the constant 
jump intensity A of the Poisson process with a deterministic (nonnegative) 
function of time A(t). This means that 


P(N(t+h) — N(t) = 1N) = A(t)h + olh) 
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and N (t) is called an inhomogeneous Poisson process. Like an ordinary Pois- 
son process it has independent increments and these increments are Poisson 
distributed, but increments over different intervals of equal length can have 
different means. In particular, the number of jumps in an interval (¢;, ti+1] 
has a Poisson distribution with mean A(t;.1) — A(t;), where 


ML) = [ A(u) du. 


Provided this function can be evaluated, simulation based on (3.82) general- 
izes easily to the inhomogeneous case: where we previously sampled from the 
Poisson distribution with mean A(ti+ı — ti), we now sample from the Poisson 
distribution with mean A(t;41) — A(t). 

It is also possible to simulate the interarrival times of the jumps. ‘The key 


property is 
P(tj41 — T; <tlr,...,7;) = 1—exp(-[A(7z; +t) — A(r;)])), t= 0, 


provided A(co) = oo. We can (at least in principle) sample from this distri- 
bution using the inverse transform method discussed in Section 2.2.1. Given 


t 
x=mtltz0:1-ap(- f Ndu) <0, U ~ Unif[0,1] 


J 


Tj, let 


then X has the required interarrival time distribution and we may set 7341 = 
T; + X. This is equivalent to setting 


xamt{ezo: [Adu] (3.83) 


j 


where € is exponentially distributed with mean 1. We may therefore inter- 
pret the time between jumps as the time required to consume an exponential 
random variable if it is consumed at rate A(u) at time u. 

If the time-varying intensity \(t) is bounded by a constant À, the jumps of 
the inhomogeneous Poisson process can be generated by thinning an ordinary 
Poisson process N with rate À, as in Lewis and Shedler [235]. In this procedure, 
the jump times of N become potential jump times of N; a potential jump at 
time t is accepted as an actual jump with probability A(t)/À. A bit more 
explicitly, we have the following steps: 


1. generate jump times 7; of N (the interarrival times Tj+1 — T; are indepen- 
dent and exponentially distributed with mean 1/,) 

2. for each j generate U; ~ Unif[0,1]; if U;A < A(7;) then accept 7; as a 
jump time of N. 


Figure 3.10 illustrates this construction. 
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Fig. 3.10. Construction of an inhomogeneous Poisson process from an ordinary 
Poisson process by thinning. The horizontal coordinates of the open circles are the 
jump times of a Poisson process with rate à; each circle is raised to a height uniformly 
distributed between 0 and À. Circles below the curve X(t) are accepted as jumps of 
the inhomogeneous Poisson process. The times of the accepted jumps are indicated 


by the filled circles. 


3.5.2 Pure-Jump Processes 


If S(t) is the jump-diffusion process in (3.79) with J(t) a compound Poisson 
process, then X(t) = log S(t) is a process with independent increments. This 
is evident from (3.82) and the fact that both W and J have independent 
increments. Geometric Brownian motion also has the property that its loga- 
rithm has independent increments. It is therefore natural to ask what other 
potentially fruitful models of asset prices might arise from the representation 


S(t) = $(0) exp(X (t)) (3.84) 


with X having independent increments. Notice that we have adopted the 
normalization X (0) = 0. 

The process X is a Lévy process if it has stationary, independent incre- 
ments and satisfies the technical requirement that X(t) converges in dis- 
tribution to X(s) as t — s. Stationarity of the increments means that 
X(t +s) — X(s) has the distribution of X(t). Every Lévy process can be 
represented as the sum of a deterministic drift, a Brownian motion, and a 
pure-jump process independent of the Brownian motion (see, e.g., Chapter 4 
of Sato [317]). If the number of jumps in every finite interval is almost surely 
finite, then the pure-jump component is a compound Poisson process. Hence, 
in constructing processes of the form (3.84) with X a Lévy process, the only 
way to move beyond the jump-diffusion process (3.79) is to consider processes 
with an infinite number of jumps in finite intervals. We will in fact focus on 
pure-jump processes of this type — that is, Lévy processes with no Brown- 
ian component. Several processes of this type have been proposed as models 
of asset prices, and we consider some of these examples. A more extensive 
discussion of the simulation of Lévy process can be found in Asmussen [21]. 
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It should be evident that in considering processes with an infinite number 
of jumps in finite intervals, only the first of the two approaches developed in 
Section 3.5.1 is viable: we may be able to simulate the increments of such a 
process, but we cannot hope to simulate from one jump to the next. To sim- 
ulate a pure-jump Lévy process we should therefore consider the distribution 
of its increments over a fixed time grid. 

A random variable Y (more precisely, its distribution) is said to be in- 
finitely divisible if for each n = 2,3,..., there are i.i.d. random variables 
ye .., ¥<” such that VA +--+ Y.” has the distribution of Y. If X is 
a Lévy process (X (0) = 0), then 


X(t) = X(t/n)+[X(2t/n)—- X(t/n)] + --- + (X(t) — X((n — 1)t/n)l 


decomposes X (t) as the sum of n i.i.d. random variables and shows that X (t) 
has an infinitely divisible distribution. Conversely, for each infinitely divisible 
distribution there is a Lévy process for which X(1) has that distribution. 
Simulating a Lévy process on a fixed time grid is thus equivalent to sampling 
from infinitely divisible distributions. l 

A Lévy process with nondecreasing sample paths is called a subordinator. 
A large class of Lévy processes (sometimes called processes of type G) can 
be represented as W (G(t)) with W Brownian motion and G a subordinator 
independent of W. Several of the examples we consider belong to this class. 


Gamma Processes 


If Yi,..., Yn are independent with distribution Gamma(a/n, 8), then Yı + 
.--+Yn has distribution Gamma(a, 3); thus, gamma distributions are infinitely 
divisible. For each choice of the parameters a and 8 there is a Lévy process 
(called a gamma process) such that X(1) has distribution Gamma(a, 3). We 
can simulate this process on a time grid t1,..., tn by sampling the increments 


X (ti41) — X (ti) ~ Gamma(a : (ti41 — ti), 8) 


independently, using the methods of Sections 3.4.2. 

A gamma random variable takes only positive values so a gamma process 
is nondecreasing. This makes it unsuitable as a model of (the logarithm of) a 
risky asset price. Madan and Seneta [243] propose a model based on (3.84) and 
X(t) = U(t)— D(t), with U and D independent gamma processes representing 
the up and down moves of X. They call this the variance gamma process. 
Increments of X can be simulated through the increments of U and D. 

If U(1) and D(1) have the same shape and scale parameters, then X admits 
an alternative representation as W(G(t)) where W is a standard Brownian 
motion and G is a gamma process. In other words, X can be viewed as the 
result of applying a random time-change to an ordinary Brownian motion: the 
deterministic time argument t has been replaced by the random time G(t), 
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which becomes the conditional variance of W(G(t)) given G(t). This explains 
the name “variance gamma.” 

Madan et al. [242] consider the more general case W(G(t)) where W now 
has drift parameter u and variance parameter o*. They restrict the shape 
parameter of G(1) to be the reciprocal of its scale parameter @ (so that 
E/G(t)] = t) and show that this more general variance gamma process can 
still be represented as the difference U(t) — D(t) of two independent gamma 
processes. The shape and scale parameters of U(1) and D(1) should be chosen 
to satisfy ay = ap = 1/6 and 


2 
Bu Bp = ae By — Bp = pp. 

The general variance gamma process can therefore still be simulated as 
the difference between two independent gamma processes. Alternatively, we 
can use the representation X (t) = W(G(t)) for simulation. Conditional on the 
increment G(t,,1)—G(t,;), the increment W(G(tii1))-W(G(t;)) has a normal 
distribution with mean u[G(ti+1)— G(t;)] and variance o7[G(tj11) — G(t:)]. 
Hence, we can simulate X as follows: 


1. generate Y ~ Gamma((tii1 — ti)/2, B) (this is the increment in G) 
2. generate Z ~ N(0, 1) 
3. set X(ti41) = X (ti) +Y +oVYZ. 


The relative merits of this method and simulation through the difference of U 
and D depend on the implementation details of the methods used for sampling 
from the gamma and normal distributions. 

Figure 3.11 compares two variance gamma densities with a normal density; 
all three have mean 0 and standard deviation 0.4. The figure illustrates the 
much higher kurtosis that can be achieved within the variance gamma family. 
Although the examples in the figure are symmetric, positive and negative 
skewness can be introduced through the parameter pn. 


Normal Inverse Gaussian Processes 


This class of processes, described in Barndorff-Nielsen [36], has some similari- 
ties to the variance gamma model. It is a Lévy process whose increments have 
a normal inverse Gaussian distribution; it can also be represented through a 
random time-change of Brownian motion. 

The inverse Gaussian distribution with parameters ô, y > 0 has density 


bee? 
fials) = Tmt exp (—4 (8x7? + yx) , >O. (3.85) 


This is the density of the first passage time to level 6 of a Brownian motion 
with drift y. It has mean ô/y and variance 6/7°. The inverse Gaussian distrib- 
ution is infinitely divisible: if Xı and Xə are independent and have this density 
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Fig. 3.11. Examples of variance gamma densities. The most peaked curve has u = 0, 
o = 0.4, and @ = 1 (and is in fact a double exponential density). The next most 
peaked curve has u = 0, o = 0.4, and 8 = 0.5. The dashed line is the normal density 
with mean 0 and standard deviation 0.4. 


with parameters (6), y) and (ô2, y), then it is clear from the first passage time 
interpretation that Xı + Xz has this density with parameters (6; + 62,7). It 
follows that there is a Lévy process Y (t) for which Y(1) has density (3.85). 

The normal inverse Gaussian distribution NIG(a, 8, u, ô) with parameters 
a, B, u, ô can be described as the distribution of 


ut PY()+JSV(DZ, Z~ N(0,1), (3.86) 


with Y(1) having density (3.85), a = y 82 + 72, and Z independent of Y (1). 
The mean and variance of this distribution are 


[ne E and ee 
T a1- fay a(l 0A 


respectively. The density is given in Barndorff-Nielsen [36] in terms of a mod- 
ified Bessel function. Three examples are graphed in Figure 3.12; these illus- 
trate the possibility of positive and negative skew and high kurtosis within 
this family of distributions. 

Independent normal inverse Gaussian random variables add in the follow- 


ing way: 
NIG(a, b, H1, 61) + NIG(a, f, H2, 02) = NIG(a, 3, m i [2,01 ai 62). 


In particular, these distributions are infinitely divisible. Barndorff-Nielsen [36] 
studies Lévy processes with NIG increments. Such a process X(t) can be 
represented as W(Y(t)) with Y(t) the Lévy process defined from (3.85) and 
W a Brownian motion with drift 8, unit variance, and initial value W (0) = u. 
At t = 1, this representation reduces to (3.86). 
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Fig. 3.12. Examples of normal inverse Gaussian densities. The parameters 
(a, 8, 4,6) are as follows: (1, —0.75, 2,1) for A, (1,0,0,1) for B, and. (1, 2, —0.75, 1) 
for C. The dashed line is the standard normal density and is included for comparison 
with case B, which also has mean 0 and standard deviation 1. 


Eberlein [109] discusses the use of the NIG Lévy processes (in fact, a more 
general family called generalized hyperbolic Lévy processes) in modeling log 
returns. Barndorff-Nielsen [36] proposes several mechanisms for constructing 
models of price processes using NIG Lévy processes as a building block. 

As with the variance gamma process of Madan and Seneta [243], there 
are in principle two strategies for simulating X on a discrete time grid. We 
can simulate the increments by sampling from the NIG distribution directly 
or we can use the representation as a time-changed Brownian motion (as in 
(3.86)). However, direct sampling from the NIG distribution does not appear 
to be particularly convenient, so we consider only the second of these two 
alternatives. 

To simulate X(t) as W(Y(t)) we need to be able to generate the incre- 
ments of Y by sampling from the (ordinary) inverse Gaussian distribution. An 
interesting method for doing this was developed by Michael, Schucany, and 
Haas [264]. Their method uses the fact that if Y has the density in (3.85), 


then 
(iso): 


ay 


we may therefore sample Y by first generating V ~ y?. Given a value of V, 
the resulting equation for Y has two roots, 


ô V 1 
= = ani eee 463 2V2 
y= —+ 35 VERV y+ BV? 


2. 
oe X13 


and 
yo = È JY y. 
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Michael et al. [264] show that the smaller root yı should be chosen with prob- 
ability 6/(6+-yyi) and the larger root yz with the complementary probability. 
Figure 3.13 illustrates the implementation of the method. The x? random vari- 
able required for this algorithm can be generated as either a Gamma(1/2, 2) 
or as the square of a standard normal. 


Setup: a — 1/7, ba*d,b— bb 


generate V ~ y? 

E a*V 

Y — a» (6 + (€/2) + JE » (ô + (€/4)) 
p-—6/(6+7*Y) 

generate U ~ Unif[0,1] 

if U > p then Y + b/Y 

return Y 


Fig. 3.13. Algorithm for sampling from the inverse Gaussian distribution (3.85), 
based on Michael et al. [264]. 


To simulate an increment of the NIG process X(t) = W (Y (t)) from t; to 
ti+1, we use the algorithm in Figure 3.13 to generate a sample Y from the 
inverse Gaussian distribution with parameters ô(ti+ı — t;) and y; we then set 


X (tiz1) = X(t) + BY +VYZ 


with Z ~ N(0,1). (Recall that 6 is the drift of W in the NIG parameteriza- 
tion.) 

Despite the evident similarity between this construction and the one used 
for the variance gamma process, a result of Asmussen and Rosiński [25] points 
to an important distinction between the two processes: the cumulative effect of 
small jumps can be well-approximated by Brownian motion in a NIG process 
but not in a variance gamma process. Loosely speaking, even the small jumps 
of the variance gamma process are too large or too infrequent to look like 
Brownian motion. Asmussen and Rosiński [25] discuss the use and applicabil- 
ity of a Brownian approximation to small jumps in simulating Lévy processes. 


Stable Paretian Processes 


A distribution is called stable if for each n > 2 there are constants an > 0 and 


b,, such that 
Xı +X2 +: + Xn =a An X + bn, 
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where X,.X1,..., Xn are independent random variables with that distribution. 
(The symbol “=,” indicates equality in distribution.) If b, = 0 for all n, the 
distribution is strictly stable. The best known example is the standard normal 
distribution for which 


Xi Xot + Xn =a 2X. 


In fact, an must be of the form n!/* for some 0 < a < 2 called the index 
of the stable distribution. This is Theorem VI.1.1 of Feller [119]; for broader 
coverage of the topic see Samorodnitsky and Taqqu [316]. 

Stable random variables are infinitely divisible and thus define Lévy 
processes. Like the other examples in this section, these Lévy processes have 
no Brownian component (except in the case of Brownian motion itself) and 
are thus pure-jump processes. They can often be constructed by applying a 
random time change to an ordinary Brownian motion, the time change itself 
having stable increments. 

Only the normal distribution has stable index a = 2. Non-normal stable 
distributions (those with a < 2) are often called stable Paretian. These are 
heavy-tailed distributions: if X has stable index a < 2, then E[|X |P] is infinite 
for p > a. In particular, all stable Paretian distributions have infinite variance 
and those with a < 1 have E{|X|} = oo. Mandelbrot [246] proposed using sta- 
ble Paretian distributions to model the high peaks and heavy tails (relative to 
the normal distribution) of market returns. Infinite variance suggests that the 
tails of these distributions may be too heavy for market data, but see Rachev 
and Mittnik [302] for a comprehensive account of applications in finance. 

Stable random variables have probability densities but these are rarely 
available explicitly; stable distributions are usually described through their 
characteristic functions. The density is known for the normal case a = 2; the 
Cauchy (or tı) distribution, corresponding to œ = 1 and density 


‘Tod 


fiage ee 


f(x) = 
and the case a = 1/2 with density 
es 
f(x) = Ne ake exp(—1/(2z)), x > 0. 


This last example may be viewed as a limiting case of the inverse Gaussian 
distribution with y = 0. Through a first passage time interpretation (see Feller 
[119], Example VI.2(f)), the Cauchy distribution may be viewed as a limiting 
case of the NIG distribution a = 8 = u = 0. Both densities given above can be 
generalized by introducing scale and location parameters (as in [316], p.10). 
This follows from the simple observation that if X has a stable distribution 
then so does u + aX, for any constants 4,0. 

As noted in Example 2.1.2, samples from the Cauchy distribution can be 
generated using the inverse transform method. If Z ~ N(0,1) then 1/Z? has 
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the stable density above with a = 1/2, so this case is also straightforward. 
Perhaps surprisingly, it is also fairly easy to sample from other stable distribu- 
tions even though their densities are unknown. An important tool in sampling 
from stable distributions is the following representation: if V is uniformly dis- 
tributed over |—7/2,7/2] and W is exponentially distributed with mean 1, 
then 
sin(aV) /cos((1—a)V) Gane 
(cos(V))?/4 ( W j 


has a symmetric a-stable distribution; see p.42 of Samorodnitsky and Taqqu 
[316] for a proof. As noted there, this reduces to the Box-Muller method 
(see Section 2.3.2) when a = 2. Chambers, Mallows, and Stuck [79] develop 
simulation procedures based on this representation and additional transfor- 
mations. Samorodnitsky and Taqqu [316], pp.46-49, provide computer code 
for sampling from an arbitrary stable distribution, based on Chambers et al. 
[79]. 

Feller [119], p.336, notes that the Lévy process generated by a symmet- 
ric stable distribution can be constructed through a random time change of 
Brownian motion. This also follows from the observation in Samorodnitsky 
and Taqqu [316], p.21, that a symmetric stable random variable can be gener- 
ated as the product of a normal random variable and a positive stable random 
variable, a construction similar to (3.86). 


3.6 Forward Rate Models: Continuous Rates 


The distinguishing feature of the models considered in this section and the 
next is that they explicitly describe the evolution of the full term structure of 
interest rates. This contrasts with the approach in Sections 3.3 and 3.4 based 
on modeling the dynamics of just the short rate r(t). In a setting like the 
Vasicek model or the Cox-Ingersoll-Ross model, the current value of the short 
rate determines the current value of all other term structure quantities — 
forward rates, bond prices, etc. In these models, the state of the world is com- 
pletely summarized by the value of the short rate. In multifactor extensions, 
like those described in Section 3.3.3, the state of the world is summarized by 
the current values of a finite number (usually small) of underlying factors; 
from the values of these factors all term structure quantities are determined, 
at least in principle. 

In the framework developed by Heath, Jarrow, and Morton |174] (HJM), 
the state of the world is described by the full term structure and not necessarily 
by a finite number of rates or factors. The key contribution of HJM lies in 
identifying the restriction imposed by the absence of arbitrage on the evolution 
of the term structure. 

At any point in time the term structure of interest rates can be described 
in various equivalent ways — through the prices or yields of zero-coupon 
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bonds or par bonds, through forward rates, and through swap rates, to name 
just a few examples. The HJM framework models the evolution of the term 
structure through the dynamics of the forward rate curve. It could be argued 
that forward rates provide the most primitive description of the term structure 
(and thus the appropriate starting point for a model) because bond prices and 
yields reflect averages of forward rates across maturities, but it seems difficult 
to press this point too far. 

From the perspective of simulation, this section represents a departure 
from the previous topics of this chapter. Thus far, we have focused on models 
that can be simulated exactly, at least at a finite set of dates. In the gener- 
ality of the HJM setting, some discretization error is usually inevitable. HJM 
simulation might therefore be viewed more properly as a topic for Chapter 6; 
we include it here because of its importance and because of special simulation 


issues it raises. 


3.6.1 The HJM Framework 


The HJM framework describes the dynamics of the forward rate curve 
{f(t,T),0 <t <T < T*} for some ultimate maturity T* (e.g., 20 or 30 
years from today). Think of this as a curve in the maturity argument T for 
each value of the time argument t; the length of the curve shrinks as time ad- 
vances because t < T < T*. Recall that the forward rate f(t, T) represents the 
instantaneous continuously compounded rate contracted at time t for riskless 
borrowing or lending at time T > t. This is made precise by the relation 


Bit, T) = exp (- / f(t, u) au) 


between bond prices and forward rates, which implies 
E EEE (3.87) 
’ PR OT g ’ ° : 


The short rate is r(t) = f(t,t). Figure 3.14 illustrates this notation and the 
evolution of the forward curve. 
In the HJM setting, the evolution of the forward curve is modeled through 


an SDE of the form 
df(t, T) = p(t,T) dt + o(t,T)' dW (t). (3.88) 


In this equation and throughout, the differential df is with respect to time t 
and not maturity T. The process W is a standard d-dimensional Brownian 
motion; d is the number of factors, usually equal to 1, 2, or 3. Thus, while the 
forward rate curve is in principle an infinite-dimensional object, it is driven by 
a low-dimensional Brownian motion. The coefficients u and ø in (3.88) (scalar 
and #4-valued, respectively) could be stochastic or could depend on current 
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Fig. 3.14. Evolution of forward curve. At time 0, the forward curve f(0,-) is defined 
for maturities in [0, T*] and the short rate is r(0) = f(0,0). At t > 0, the forward 
curve f(t,:) is defined for maturities in |t, T*] and the short rate is r(t) = f (t,t). 


and past levels of forward rates. We restrict attention to the case in which u 
and o are deterministic functions of t, T > t, and the current forward curve 
{f(t,u),t < u < T*}. Subject to technical conditions, this makes the evolution 
of the curve Markovian. We could make this more explicit by writing, e.g., 
o(f,t, T), but to lighten notation we omit the argument f. See Heath, Jarrow, 
and Morton |174] for the precise conditions needed for (3.88). 

We interpret (3.88) as modeling the evolution of forward rates under the 
risk-neutral measure (meaning, more precisely, that W is a standard Brownian 
motion under that measure). We know that the absence of arbitrage imposes a 
condition on the risk-neutral dynamics of asset prices: the price of a (dividend- 
free) asset must be a martingale when divided by the numeraire 


Sees ( / ee) du) 


Forward rates are not, however, asset prices, so it is not immediately clear 
what restriction the absence of arbitrage imposes on the dynamics in (3.88). 
To find this restriction we must start from the dynamics of asset prices, in 
particular bonds. Our account is informal; see Heath, Jarrow, and Morton 
[174] for a rigorous development. 
To make the discounted bond prices B(t,T)/@(t) positive martingales, we 
posit dynamics of the form 
ae _ ED n O TTT (3.89) 
B(t, T) 
The bond volatilities v(t, T) may be functions of current bond prices (equiva- 
lentlv of current forward rates since (3.87) makes a one-to-one correspondence 
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between the two). Through (3.87), the dynamics in (3.89) constrain the evo- 
lution of forward rates. By Itô’s formula, 


dlog B(t, T) = [r(t) — 4v(t, T)! v(t, T)] dt + v(t, T)' dW (t). 


If we now differentiate with respect to T and then interchange the order of 
differentiation with respect to t and T, from (3.87) we get 


ð 
df(t, T) = -zp log Bit, T) 
2 1 i O u 
=~ oF rt) = SPT) vT) dt— ap lt T) dW (t). 
Comparing this with (3.88), we find that we must have 


CG.) = -Lule T) 


and 
u(t, 7) = s = T eT) = L T) i y(t, T) 
7 OT 2 , ? OT } ’ 5 
To eliminate v(t, T) entirely, notice that 
T 
VT = -f o(t, u) du + constant. 
t 


But because B(t, T) becomes identically 1 as t approaches T (i.e., as the bond 
matures), we must have v(T, T) = 0 and thus the constant in this equation is 
0. We can therefore rewrite the expression for p as 


T 
ut, T) Soe o o(t, u) du: (3.90) 


t 


this is the risk-neutral drift imposed by the absence of arbitrage. Substituting 
in (3.88), we get 


df(t, T) = (2 T [ o(t, u) au) dt +a(t,T)' dW(t). (3.91) 


This equation characterizes the arbitrage-free dynamics of the forward curve 
under the risk-neutral measure; it is the centerpiece of the HJM framework. 

Using a subscript 7 = 1,...,d to indicate vector components, we can write 
(3.91) as 


d 


df (t, T) = D (ot T) [ OF (t, u) au) dt + ` Oj (t, T) dW;(t). (3.92) 


j=l1 J=} 
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This makes it evident that each factor contributes a term to the drift and that 
the combined drift is the sum of the contributions of the individual factors. 

In (3.91), the drift is determined once ø is specified. This contrasts with the 
dynamics of the short rate models in Sections 3.3 and 3.4 where parameters 
of the drift could be specified independent of the diffusion coefficient without 
introducing arbitrage. Indeed, choosing parameters of the drift is essential in 
calibrating short rate models to an observed set of bond prices. In contrast, 
an HJM model is automatically calibrated to an initial set of bond prices 
B(0,T) if the initial forward curve f(0,T) is simply chosen consistent with 
these bond prices through (3.87). Put slightly differently, calibrating an HJM 
model to an observed set of bond prices is a matter of choosing an appropriate 
initial condition rather than choosing a parameter of the model dynamics. The 
effort in calibrating an HJM model lies in choosing o to match market prices 
of interest rate derivatives in addition to matching bond prices. 

We illustrate the HJM framework with some simple examples. 


Example 3.6.1 Constant o. Consider a single-factor (d = 1) model in which 
a(t, T) =o for some constant o. The interpretation of such a model is that 
each increment dW (t) moves all points on the forward curve {f(t,u),t < u < 
T*} by an equal amount ø dW (t), the diffusion term thus introduces only 
parallel shifts in the forward curve. But a model in which the forward curve 
makes only parallel shifts admits arbitrage opportunities: one can construct 
a costless portfolio of bonds that will have positive value under every parallel 
shift. From (3.90) we find that an HJM model with constant o has drift 


T 
iC = of odu =0°(T — t). 


In particular, the drift will vary (slightly, because ø? is small) across maturi- 
ties, keeping the forward curve from making exactly parallel movements. This 
small adjustment to the dynamics of the forward curve is just enough to keep 
the model arbitrage-free. In this case, we can solve (3.91) to find 


t 
FET S70,7) +f o*(T — u) du + oW (t) 
0 
= f(0,T) + 40° [T° — (T —t)*] + oW (t). 
In essentially any model, the identity r(t) = f(t, t) implies 


o 
+ apd (t T) 
Tt 


dr(t) = df(t, T) dt. 


T=t 


In the case of constant o, we can write this explicitly as 


dr(t) = o dW (t) + (E, | +o?t) dt. 
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Comparing this with (3.50), we find that an HJM model with constant o 
coincides with a Ho-Lee model with calibrated drift. O 


Example 3.6.2 Exponential o. Another convenient parameterization takes 
olt, T) = oexp(—a(T — t)) for some constants o,a@ > 0. In this case, the 
diffusion term o(t,T)dW (t) moves forward rates for short maturities more 
than forward rates for long maturities. The drift is given by 


PET) = tenet f7 ead dy =F (207-9 — galt) 
t 
An argument similar to the one used in Example 3.6.1 shows that the short 
rate in this case is described by the Vasicek model with time-varying drift 
parameters. 
This example and the one that precedes it may be misleading. It would be 
incorrect to assume that the short rate process in an HJM setting will always 
have a convenient description. Indeed, such examples are exceptional. O 


Example 3.6.3 Proportional a. It is tempting to consider a specification of 
the form o(t,T) = a(t, T) f(t, T) for some deterministic & depending only on 
t and T. This would make a(t,T) the volatility of the forward rate f(t,T) 
and would suggest that the distribution of f(t, T) is approximately lognor- 
mal. However, Heath et al. [174] note that this choice of ø is inadmissible: it 
produces forward rates that grow to infinity in finite time with positive prob- 
ability. The difficulty, speaking loosely, is that if ø is proportional to the level 
of rates, then the drift is proportional to the rates squared. This violates the 
linear growth condition ordinarily required for the existence and uniqueness of 
solutions to SDEs (see Appendix B.2). Market conventions often presuppose 
the existence of a (proportional) volatility for forward rates, so the failure 
of this example could be viewed as a shortcoming of the HJM framework. 
We will see in Section 3.7 that the difficulty can be avoided by working with 
simple rather than continuously compounded forward rates. O 


Forward Measure 


Although the HJM framework is usually applied under the risk-neutral mea- 
sure, only a minor modification is necessary to work in a forward measure. Fix 
a maturity Tp and recall that the forward measure associated with Tp corre- 
sponds to taking the bond B(t, Tr) as numeraire asset. The forward measure 
Py, can be defined relative to the risk-neutral measure Pg through 


(Te ) _ Bt, Tr)G(0) 


E B(t)B(O, Tr) 
From the bond dynamics in (3.89), we find that this ratio is given by 


exp (~3 [ v(u, Tp)! v(u, Tr) du + [To aw(u)) ; 
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By the Girsanov Theorem, the process W7" defined by 
dW? (t) = —v(t, Tr)! dt + dW (t) 


is therefore a standard Brownian motion under Prp. Recalling that v(t, T) is 
the integral of —o(t,u) from u = t to u = T, we find that the forward rate 


dynamics (3.91) become 
df(t, T) = —o(t,T)'v(t,T) dt + oft,T)' (v(t, Tr)" dt + dW? (t) 
= —o(t,T)'[v(t,T) — v(t, Tr)| dt + o(t,T)' dW?” (t) 


—o(t, T)" fý a(t, u) au) dt +a(t,T)' dW7*(t), (3.93) 


for t < T < Tp. Thus, the HJM dynamics under the forward measure are 
similar to the dynamics under the risk-neutral measure, but where we previ- 
ously integrated o(t, u) from t to T, we now integrate —o(t,u) from T to Tp. 
Notice that f(t, Tp) is a martingale under Prp, though none of the forward 
rates is a martingale under the risk-neutral measure. 


3.6.2 The Discrete Drift 


Except under very special choices of ø, exact simulation of (3.91) is infeasible. 
Simulation of the general HJM forward rate dynamics requires introducing a 
discrete approximation. In fact, each of the two arguments of f(t, T) requires 
discretization. For the first argument, fix a time grid 0 = to <t <- < 
tm. Even at a fixed time t;, it is generally not possible to represent the full 
forward curve f(t; T), ti < T < T*, so instead we fix a grid of maturities 
and approximate the forward curve by its value for just these maturities. In 
principle, the time grid and the maturity grid could be different; however, 
assuming that the two sets of dates are the same greatly simplifies notation 
with little loss of generality. 

We use hats to distinguish discretized variables from their exact continuous- 
time counterparts. Thus, f(t;,t;) denotes the discretized forward rate for ma- 
turity t; as of time t;, 7 > 2, and (ti, tj) denotes the corresponding bond 
price, 


B(ti, tz) = exp (- wie te) [te+1 = a) . (3.94) 


=i 
To avoid introducing any more discretization error than necessary, we 
would like the initial values of the discretized bonds B(0,t,;) to coincide with 
the exact values B(0,t;) for all maturities t; on the discrete grid. Comparing 


(3.94) with the equation that precedes (3.87), we see that this holds if 


tj 


j—1 
NO fO, te) ter: — te] = | (0, u) du; 
£=0 


0 
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i.e., if 
bet 


fO) = —— | f(0,u) du (3.95) 
te+1 — te te 

for all £ = 0,1,..., M — 1. This indicates that we should initialize each f (0, te) 

to the average level of the forward curve f(0,T) over the interval fte, te+1] 

rather than, for example, initializing it to the value f(0, tc) at the left endpoint 

of this interval. The discretization (3.95) is illustrated in Figure 3.15. 


Fig. 3.15. Discretization of initial forward curve. Each discretized forward rate is 
the average of the underlying forward curve over the discretization interval. 


Once the initial curve has been specified, a generic simulation of a single- 
factor model evolves like this: for i = 1,...,M, 


F (tists) = ftir, ti) t 
Vitae tilti = ti—1] + Giga ti) t — ti-i Ži, a nee M, (3.96) 


where Z1,...,Zm are independent N(0,1) random variables and Ê and ô 
denote discrete counterparts of the continuous-time coefficients in (3.91). We 
allow ô to depend on the current vector f as well as on time and maturity, 
though to lighten notation we do not include f as an explicit argument of ô. 

In practice, ô would typically be specified through a calibration procedure 
designed to make the simulated model consistent with market prices of actively 
traded derivative securities. (We discuss calibration of a closely related class 
of models in Section 3.7.4.) In fact, the continuous-time limit o(t, T) may 
never be specified explicitly because only the discrete version ô is used in the 
simulation. But the situation for the drift is different. Recall that in deriving 
(3.91) we chose the drift to make the model arbitrage-free; more precisely, 
we chose it to make the discounted bond prices martingales. There are many 
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ways one might consider choosing the discrete drift fi in (3.96) to approximate 
the continuous-time limit (3.90). From the many possible approximations, we 
choose the one that preserves the martingale property for the discounted bond 


prices. 
Recalling that f(s,s) is the short rate at time s, we can express the 


continuous-time condition as the requirement that 


BE Tex (- [ fo is) 


be a martingale in ¢ for each T. Similarly, in the discretized model we would 


like 
B(ti, t;) jen (-; Si f (tk, te) tan = 4] 


to be a martingale in 2 for each 7. Our objective is to find a à for which this 
holds. For simplicity, we start by assuming a single-factor model. 
The martingale condition can be expressed as 


E Bes tjje 2—0 F (thst) tetr = tal |Z, song Zi-1| 
= Blt. aston ia aa 
Using (3.94) and canceling terms that appear on both sides, this reduces to 


£ 


E fe- ` F(tiste)[tepi—te 7 sauce Zai a3 e` ye f(ti—1,te)[teza—te] 


Now we introduce ji: on the left side of this equation we substitute for each 
f(ti, te) according to (3.96). This yields the condition 


E lew A: (Flti-ite)tAlti-1,te)lti—ti-1]+ê(ti-1,te)y/ti—ti—1 Zi ) [tog —te] wee Zi-1| 


a 
ose yoo, Pte 1 ste) [tep1 -te] 


Canceling terms that appear on both sides and rearranging the remaining 
terms brings this into the form 


ga) x 
— ti yjst ti—ti—ılt —tp|Z; 
E [e ae O(ti—aste)y/ts 1 [te+1—te] Zi, ees Z| 


‘1. 
irene Eb (i(te—1,te) [ts —ts—1][ter1 —te| 


The conditional expectation on the left evaluates to 


e i 
ob (Di; èltizito)ltepi—tel) ititi] 
Y 


so equality holds if 
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j-1 jo} 
1 (Sotesttes el) = Salta, tellers — th 


l=i 
i.e., if 


Â(ti— 1; tj \tj4a - tj jl = 
2 . 2 


> (>: G(ti-1, te) |te+1 ice a] — 5 (S G(ti-1, te) [tery = ta) . (3.97) 


e=i b=i 
This is the discrete version of the HJM drift; it ensures that the discretized 


discounted bond prices are martingales. 

To see the connection between this expression and the continuous-time 
drift (3.90), consider the case of an equally spaced grid, t; = ih for some 
increment h > 0. Fix a date t and maturity T and let 2,7 — oo and h — 0 
in such a way that jh = T and ih = t; each of the sums in (3.97) is then 
approximated by an integral. Dividing both sides of (3.97) by tj41 — tj =A, 
we find that for small h the discrete drift is approximately 


z ([" stomyau) - ([" stemyau) x~ (EZDI 


which is z 
o(t,T) J udu: 
f 


This suggests that the discrete drift in (3.97) is indeed consistent with the 


continuous-time limit in (3.90). 
In the derivation leading to (3.97) we assumed a single-factor model. A 
similar result holds with d factors. Let ô% denote the kth entry of the d-vector 


o and 


[Lk (ti~ 1,6; \[tj41 — t;| = 
2 2 


2 (>: On (ti-1, te) [tera — a] E 3 (S Ôr (ti—1, te) [tera — a] 
l= 


{=i 


for k = 1,...,d. The combined drift is given by the sum 


ju(ti-1,t -Y ialt- it 


A generic multifactor simulation takes the form 


f(ti,t;) = F(te1, ty) + A(t-1, ty) [ty — tea] 


d 
+S — Ge (ti-1, tj) Vi — G1 Zik» Ph ten Ma (3.98) 
k=1 
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where the Zi = (Zi1,..., Zia), i = 1,..., M, are independent N(0, J) random 
vectors. 

We derived (3.97) by starting from the principle that the discretized dis- 
counted bond prices should be martingales. But what are the practical impli- 
cations of using some other approximation to the continuous drift instead of 
this one? To appreciate the consequences, consider the following experiment. 
Imagine simulating paths of f as in (3.96) or (3.98). From a path of f we may 
extract a path 


*(to) = f(to.to), F(t) =f(ti,t1), ... Atm) =f (tut), 


of the discretized short rate f. From this we can calculate a discount factor 


D(t;) = exp (-; a f(t a l) (3.99) 


for each maturity t;. Imagine repeating this over n independent paths and let 
DO (t;),..., D (tj) denote discount factors calculated over these n paths. 
A consequence of the strong law of large numbers, the martingale property, 
and the initialization in (3.95) is that, almost surely, 


LSD DOG) — ELD] = BO,t;) = B0, t) 


This means that if we simulate using (3.97) and then use the simulation to 
price a bond, the simulation price converges to the value to which the model 
was ostensibly calibrated. With some other choice of discrete drift, the simu- 
lation price would in general converge to something that differs from B(0,t;), 
even if only slightly. Thus, the martingale condition is not simply a theoretical 
feature — it is a prerequisite for internal consistency of the simulated model. 
Indeed, failure of this condition can create the illusion of arbitrage opportu- 
nities. If ELD“) (t;)] # B(0,t;), the simulation would be telling us that the 
market has mispriced the bond. 

The errors (or apparent arbitrage opportunities) that may arise from using 
a different approximation to the continuous-time drift may admittedly be 
quite small. But given that we have a simple way of avoiding such errors and 
given that the form of the drift is the central feature of the HJM framework, 
we may as well restrict ourselves to (3.97). This form of the discrete drift 
appears to be in widespread use in the industry; it is explicit in Andersen 


[1 1]. 
Forward Measure 


Through an argument similar to the one leading to (3.97), we can find the ap- 
propriate form of the discrete drift under the forward measure. In continuous 
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time, the forward measure for maturity Tp is characterized by the require- 
ment that B(t,T)/B(t,Tr) be a martingale, because the bond maturing at 
Tp is the numeraire asset associated with this measure. In the discrete ap- 
proximation, if we take tm = Tp, then we require that Blt, t;)/B(ti, tm) be 
a martingale in ¿ for each j. This ratio is given by 


h M-1 
= 2 Rey 7 = exp ` f (ti, te) [ter — te] 


b=; 
The martingale condition leads to a discrete drift A with 


fa(ti—1, tj )[tj41 — ty] = 


M—1 j M-—1 i 
5 | XO S(te-1, tolteni — te] ] -4 | X êlti1,te)lte+ı — te] | . (3.100) 
ł=j+1 l=j 


The relation between this and the risk-neutral discrete drift (3.97) is, not sur- 
prisingly, similar to the relation between their continuous-time counterparts 


in (3.91) and (3.93). 


3.6.3 Implementation 


Once we have identified the discrete form of the drift, the main consideration 
in implementing an HJM simulation is keeping track of indices. The notation 
f (ti, tj) is convenient in describing the discretized model — the first argument 
shows the current time, the second argument shows the maturity to which 
this forward rate applies. But in implementing the simulation we are not 
interested in keeping track of an M x M matrix of rates as the notation 
f (ti, tj) might suggest. At each time step, we need only the vector of current 
rates. To implement an HJM simulation we need to adopt some conventions 
regarding the indexing of this vector. 

Recall that our time and maturity grid consists of a set of dates 0 = 
to < ty <- < ty. If we identify tm with the ultimate maturity T* in 
the continuous-time model, then tm is the maturity of the longest-maturity 
bond represented in the model. In light of (3.94), this means that the last 
forward rate relevant to the model applies to the interval [tm—1, tml]; this is 
the forward rate with maturity argument tm-—1ı. Thus, our initial vector of 
forward rates consists of the M components f(0,0), f(0,t1),..., f (0, tm-1), 
which is consistent with the initialization (3.95). At the start of the simulation 
we will represent this vector as (f1,..., fm). Thus, our first convention is to 
use 1 rather than 0 as the lowest index value. 

As the simulation evolves, the number of relevant rates decreases. At time 
ti, only the rates f(t;,t;),..., f(ti,tm-r) are meaningful. We need to specify 
how these M — i rates should be indexed, given that initially we had a vector 
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of M rates: we can either pad the initial portion of the vector with irrelevant 
data or we can shorten the length of the vector. We choose the latter and 
represent the M — i rates remaining at t; as the vector (fi,..., faz_i). Thus, 
our second convention is to index forward rates by relative maturity rather 
than absolute maturity. At time t;, fj refers to the forward rate f (ea eee ae) 
Under this convention fı always refers to the current level of the short rate 


because F(ti) = f (ti, ti). 

Similar considerations apply to ji(t;,t;) and ,(ti,t;), k = 1,...,d, and 
we adopt similar conventions for the variables representing these terms. For 
values of ji we use variables m; and for values of ô we use variables s;(k); 
in both cases the subscript indicates a relative maturity and in the case of 
s;(k) the argument k = 1,...,d refers the factor index in a d-factor model. 
We design the indexing so that the simulation step from ¢;_; to t; indicated 


in (3.98) becomes 


d 
fi = Fj tmyti — teal + So 8y(k) Vti- te Ze, j=l, Mi. 
k=1 


Thus, in advancing from t;_1 to t; we want 
Mj = B(ti-1,tizj-1), Sj(k) = Ge (tir, ti+j—1). (3.101) 


In particular, recall that 6 may depend on the current vector of forward rates; 
as implied by (3.101), the values of all s;(k) should be determined before the 
forward rates are updated. 

To avoid repeated calculation of the intervals between dates t;, we intro- 


duce the notation 
ee oe ere 


These values do not change in the course of a simulation so we use the vector 
(hi,..., hm) to represent these same values at all steps of the simulation. 

We now proceed to detail the steps in an HJM simulation. We separate 
the algorithm into two parts, one calculating the discrete drift parameter at 
a fixed time step, the other looping over time steps and updating the forward 
curve at each step. Figure 3.16 illustrates the calculation of 


Leis) = 
d 


j 2 d /fj-1 
= ` (>: alten tote = > (5 altiste) 
J \e=i 


bai dl ae) 


2 


in a way that avoids duplicate computation. In the notation of the algorithm, 
this drift parameter is evaluated as 
1 


aa aa Bisa Sp toyi 
2(tj41 — i) 
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and each Anext(k) records a quantity of the form 
j 
` Ôp (ti—1, te) Rest. 


lzi 


Inputs: s;(k), j =1,..., M — i, k=1,...,d as in (3.101) 
and hi,...,hm (he = fp — te_1) 


E 0 Sh aed 
for p= Waal =e 
Bnext — 0 
for k= Tec 


Anext (k) — Aprev(k) + Sj (k) x hit; 
Brext a Bnext F Anext (k) * AÁ next (k) 
Aprev (k) oma Anext(k) 
end 
m3 — (Bnext ~ Bprev)/(2hi+5) 
Bprev — Bnrext 
end 


return ™1,...,™™M-i. 


Fig. 3.16. Calculation of discrete drift parameters m; = fi(ti—1, ti+j—-1) needed to 
simulate transition from t;-1 to ti. 


Figure 3.17 shows an algorithm for a single replication in an HJM simu- 
lation; the steps in the figure would naturally be repeated over many inde- 
pendent replications. This algorithm calls the one in Figure 3.16 to calculate 
the discrete drift for all remaining maturities at each time step. The two algo- 
rithms could obviously be combined, but keeping them separate should help 
clarify the various steps. In addition, it helps stress the point that in propa- 
gating the forward curve from t;_; to t;, we first evaluate the s,;(k) and mj 
using the forward rates at step i — 1 and then update the rates to get their 
values at step 2. 

To make this point a bit more concrete, suppose we specified a single-factor 
model with G(t;,t;) = (i, j) f(t, tj) for some fixed values (i, j). This makes 
each G(t;,t;) proportional to the corresponding forward rate. We noted in Ex- 
ample 3.6.3 that this type of diffusion term is inadmissible in the continuous- 
time limit, but it can be (and often is) used in practice so long as the incre- 
ments h; are kept bounded away from zero. In this model it should be clear 
that in updating f(t;-1,t,;) to f (ti t;) we need to evaluate õ(i— 1, j) f (ti—1, tj) 
before we update the forward rate. 

Since an HJM simulation is typically used to value interest rate derivatives, 
we have included in Figure 3.17 a few additional generic steps illustrating how 
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Inputs: initial curve (f1,..., fm) and intervals (hi,..., hm) 


D-1, P 0,C 0. 
for i = 1,..., M — 1 
D + D x exp(— fı * hi) 
evaluate s;(k), 7 =1,...,M—i,k=1,...,d 
(recall that sj(k) =O; ase Laisa) 
evaluate mı,...,mm-—i using Figure 3.16 
generate Z1,...,Zq ~ N(0,1) 
for j} = 1,..., M —i 
S — 0 
lbr k= berd o AN EEE e 
fs — fiti + my » hi +S x vhi 
end 
P — cashflow at t; (depending on instrument) 
C—C+D*«P 
end 
return Č. 


Fig. 3.17. Algorithm to simulate evolution of forward curve over to,t1,...,t—1 
and calculate cumulative discounted cashflows from an interest rate derivative. 


a path of the forward curve is used both to compute and to discount the pay- 
off of a derivative. The details of a particular instrument are subsumed in 
the placeholder “cashflow at ti.” This cashflow is discounted through mul- 
tiplication by D, which is easily seen to contain the simulated value of the 
discount factor D(t;) as defined in (3.99). (When D is updated in Figure 3.17, 
before the forward rates are updated, fı records the short rate for the interval 
[t;-1,¢;].) To make the pricing application more explicit, we consider a few 
examples. 


Example 3.6.4 Bonds. ‘There is no reason to use an HJM simulation to price 
bonds — if properly implemented, the simulation will simply return prices 
that could have been computed from the initial forward curve. Nevertheless, 
we consider this example to help fix ideas. We discussed the pricing of a zero- 
coupon bond following (3.99); in Figure 3.17 this corresponds to setting P — 1 
at the maturity of the bond and P < 0 at all other dates. For a coupon paying 
bond with a face value of 100 and a coupon of c, we would set P — c at the 
coupon dates and P — 100+c at maturity. This assumes, of course, that the 
coupon dates are among the t1,..., tm. O 


Example 3.6.5 Caps. A caplet is an interest rate derivative providing pro- 
tection against an increase in an interest rate for a single period; a cap is 
a portfolio of caplets covering multiple periods. A caplet functions almost 
like a call option on the short rate, which would have a payoff of the form 
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(r(T) — K)* for some strike K and maturity T. In practice, a caplet differs 
from this in some small but important ways. (For further background, see 
Appendix C.) 

In contrast to the instantaneous short rate r(t), the underlying rate in a 
caplet typically applies over an interval and is based on discrete compound- 
ing. For simplicity, suppose the interval is of the form fti, ti+ı]. At ti, the 
continuously compounded rate for this interval is f(t;, ti); the corresponding 
discretely compounded rate F satisfies 


re ee aes ew F ltioti)lti+i =t], 
1+ F(t,)[tia. — ti] 


Le., 
F(t;) = _ i Cae aise 1) 
titi — ti 
The payoff of the caplet would then be (F(t;) — K)+ (or a constant multiple 
of this). Moreover, this payment is ordinarily made at the end of the interval, 
t;41. To discount it properly we should therefore simulate to t; and set 


— o t ee nla Ses 
1+ F (ta) (tis — t;| pans (3.102) 


in the notation of Figure 3.17, this is 


+ 
ge e` fihi+i Con (eres! = 1) = K) l 


Aisi 
Similar ideas apply if the caplet covers an interval longer than a single 


simulation interval. Suppose the caplet applies to an interval lti, titn|. Then 
(3.102) still applies at t;, but with ti+ı replaced by tiyn and F(t;) redefined 


to be 
= e | exp D Jlis ti+e)| bite+1 — ti+e| ae ee 
— Tn t; —0 


In the case of a cap consisting of caplets for, say, the periods |t; , tial, [t:,, tis], 
, [tins tiri], for some i} < tg < +++ < te41, this calculation would be 
repeated and a cashflow recorded at each t;,,7 =1,...,k. O 


Example 3.6.6 Swaptions. Consider, next, an option to swap fixed-rate pay- 
ments for floating-rate payments. (See Appendix C for background on swaps 
and swaptions.) Suppose the ee swap begins at tip with payments to 
be exchanged at dates t;,,... . If we denote the fixed rate in the swap by 
R, then the fixed-rate nn: i ti, is LOOR|t;, —t;,_,], assuming a principal 
or notional amount of 100. As explained in Section C.2 of Appendix C, the 


value of the swap at tj is 
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n 
V (tja) = 100 (Ry B(tjos tie) tie — tje-1] + B (tios tin) — 7 , 


The bond prices Blt tie) can be computed from the forward rates at tjo 
using (3.94). 

The holder of an option to enter this swap will exercise the option if 
V(t;,) > 0 and let it expire otherwise. (For simplicity, we are assuming 
that the option expires at t;, though similar calculations apply for an op- 
tion to enter into a forward swap, in which case the option expiration date 
would be prior to t,,.) Thus, we may view the swaption as having a payoff of 
max{0,V(t;,)} at t;,. In a simulation, we would therefore simulate the for- 
ward curve to the option expiration date t,,; at that date, calculate the prices 
of the bonds B (t;,,t;,) maturing at the payment dates of the swaps; from 
the bond prices calculate the value of the swap V (ti) and thus the swap- 
tion payoff max{0, ¥ (t;,)}; record this as the cashflow P in the algorithm of 
Figure 3.17 and discount it as in the algorithm. 

This example illustrates a general feature of the HJM framework that 
contrasts with models based on the short rate as in Sections 3.3 and 3.4. 
Consider valuing a 5-year option on a 20-year swap. This instrument involves 
maturities as long as 25 years, so valuing it in a model of the short rate could 
involve simulating paths over a 25-year horizon. In the HJM framework, if the 
initial forward curve extends for 25 years, then we need to simulate only for 
5 years; at the expiration of the option, the remaining forward rates contain 
all the information necessary to value the underlying swap. Thus, although 
the HJM setting involves updating many more variables at each time step, it 
may also require far fewer time steps. O 


3.7 Forward Rate Models: Simple Rates 


The models considered in this section are closely related to the HJM frame- 
work of the previous section in that they describe the arbitrage-free dynamics 
of the term structure of interest rates through the evolution of forward rates. 
But the models we turn to now are based on simple rather than continu- 
ously compounded forward rates. This seemingly minor shift in focus has 
surprisingly far-reaching practical and theoretical implications. This model- 
ing approach has developed primarily through the work of Miltersen, Sand- 
mann, and Sondermann [268], Brace, Gatarek, and Musiela [56], Musiela and 
Rutkowski [274], and Jamshidian [197]; it has gained rapid acceptance in the 
financial industry and stimulated a growing stream of research into what are 
often called LIBOR market models. 
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3.7.1 LIBOR Market Model Dynamics 


The basic object of study in the HJM framework is the forward rate curve 
{f(t,T),t <T < T*}. But the instantaneous, continuously compounded for- 
ward rates f(t,T) might well be considered mathematical idealizations — 
they are not directly observable in the marketplace. Most market interest 
rates are based on simple compounding over intervals of, e.g., three months 
or six months. Even the instantaneous short rate r(t) treated in the models 
of Sections 3.3 and 3.4 is a bit of a mathematical fiction because short-term 
rates used for pricing are typically based on periods of one to three months. 
The term “market model” is often used to describe an approach to interest 
rate modeling based on observable market rates, and this entails a departure 
from instantaneous rates. 

Among the most important benchmark interest rates are the London Inter- 
Bank Offered Rates or LIBOR. LIBOR is calculated daily through an average 
of rates offered by banks in London. Separate rates are quoted for different 
maturities (e.g., three months and six months) and different currencies. Thus, 
each day new values are calculated for three-month Yen LIBOR, six-month 
US dollar LIBOR, and so on. 

LIBOR rates are based on simple interest. If L denotes the rate for an 
accrual period of length ô (think of ô as 1/4 or 1/2 for three months and six 
months respectively, with time measured in years), then the interest earned 
on one unit of currency over the accrual period is ôL. For example, if three- 
month LIBOR is 6%, the interest earned at the end of three months on a 
principal of 100 is 0.25 - 0.06 - 100 = 1.50. 

A forward LIBOR rate works similarly. Fix 6 and consider a maturity T. 
The forward rate L(0, T) is the rate set at time 0 for the interval [T, T + ô]. 
If we enter into a contract at time 0 to borrow 1 at time T and repay it 
with interest at time T + 6, the interest due will be dL(0,7). As shown in 
Appendix C (specifically equation (C.5)), a simple replication argument leads 
to the following identity between forward LIBOR rates and bond prices: 


L(0,T) = RT a (3.103) 


This further implies the relation 


T+6 
L(0,T) = : (os (/ i f (0, u) du) — 7 (3.104) 


between continuous and simple forwards, though it is not necessary to intro- 
duce the continuous rates to build a model based on simple rates. 

It should be noted that, as is customary in this literature, we treat the 
forward LIBOR rates as though they were risk-free rates. LIBOR rates are 
based on quotes by banks which could potentially default and this risk is pre- 
sumably reflected in the rates. US Treasury bonds, in contrast, are generally 
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considered to have a negligible chance of default. The argument leading to 
(3.103) may not hold exactly if the bonds on one side and the forward rate 
on the other reflect different levels of creditworthiness. We will not, however, 
attempt to take account of these considerations. 

Although (3.103) and (3.104) apply in principle to a continuum of matu- 
rities T, we consider a class of models in which a finite set of maturities or 
tenor dates 

O=T7) < Ti <- < Tm < TM+ 


are fixed in advance. As argued in Jamshidian |197], many derivative securities 
tied to LIBOR and swap rates are sensitive only to a finite set of maturities 
and it should not be necessary to introduce a continuum to price and hedge 
these securities. Let 


6:=Tii-T;, i=0,...,M, 


denote the lengths of the intervals between tenor dates. Often, these would 
all be equal to a nominally fixed interval of a quarter or half year; but even 
in this case, day-count conventions would produce slightly different values for 
the fractions ĝ;. 

For each date Tn we let B,(t) denote the time-t price of a bond maturing 
at Th, 0 < t < Tn. In our usual notation this would be B(t, Th), but writing 
B,(t) and restricting n to {1,2,..., M + 1} emphasizes that we are working 
with a finite set of bonds. Similarly, we write L,,(t) for the forward rate as of 
time t for the accrual period [Tn, Tn+1]; see Figure 3.18. This is given in terms 
of the bond prices by 


By — 2n 
Bn(t) = Barit) Cn Pe. RS 0 bau, (3.105) 
6nBr4il(t) 


After Tn, the forward rate Ln becomes meaningless; it sometimes simplifies 
notation to extend the definition of L,(t) beyond Tn by setting L,(t) = 
La(Tn) for all t > Th. 

From (3.105) we know that bond prices determine the forward rates. At a 
tenor date T;, the relation can be inverted to produce 


Ln(t) = 


n—1 
1 
B T; = a aO ANA = 4 Fassa M 1. 3.106 
n(Zi) liam este T (3.106) 


j=i 


However, at an arbitrary date t, the forward LIBOR rates do not determine 
the bond prices because they do not determine the discount factor for intervals 
shorter than the accrual periods. Suppose for example that T; < t < Ti+ı and 
we want to find the price Bn (t) for some n > i + 1. The factor 


n—1 1 


IL F550 


j=i+1 
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Fig. 3.18. Evolution of vector of forward rates. Each Ln (t) is the forward rate for 
the interval [Tn, Tn+1] as of time t < Th. 


discounts the bond’s payment at Tn back to time T;+1, but the LIBOR rates 


do not specify the discount factor from T;i+ı to t. 
Define a function n : [0,Tm+1) > {1,..., M + 1} by taking n(t) to be the 
unique integer satisfying 


Ine- S E< Liye; 


thus, 7(t) gives the index of the next tenor date at time t. With this notation, 
we have 


n—1 
1 
jantt) 


the factor By) (t) (the current price of the shortest maturity bond) is the 
missing piece required to express the bond prices in terms of the forward 


LIBOR rates. 


Spot Measure 


We seek a model in which the evolution of the forward LIBOR rates is de- 
scribed by a system of SDEs of the form 


dLn(t) 
Ln(t) 


= pin(t)dt+o,(t)' dW(t), 0<t<T,, n=1,...,M, (3.108) 


with W a d-dimensional standard Brownian motion. The coefficients un and 
cn may depend on the current vector of rates (L1 (t), ..., Lac(t)) as well as the 
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current time t. Notice that in (3.108) on is the (proportional) volatility because 
we have divided by Ln on the left, whereas in the HJM setting (3.91) we took 
a(t, T) to be the absolute level of volatility. At this point, the distinction is 
purely one of notation rather than scope because we allow a,,(t) to depend 
on the current level of rates. 

Recall that in the HJM setting we derived the form of the drift of the 
forward rates from the absence of arbitrage. More specifically, we derived the 
drift from the condition that bond prices be martingales when divided by 
the numeraire asset. The numeraire we used is the usual one associated with 
the risk-neutral measure, 3(t) = exp( T r(u) du). But introducing a short-rate 
process r(t) would undermine our objective of developing a model based on 
the simple (and thus more realistic) rates Ln (t). We therefore avoid the usual 
risk-neutral measure and instead use a numeraire asset better suited to the 
tenor dates T;. 

A simply compounded counterpart of O(t) works as follows. Start with 1 
unit of account at time 0 and buy 1/B,(0) bonds maturing at Tı. At time T}, 
reinvest the funds in bonds maturing at time Tə and proceed this way, at each 
T; putting all funds in bonds maturing at time T;+1. This trading strategy 
earns (simple) interest at rate L;(T;) over each interval [T;,7j41], just as in 
the continuously compounded case a savings account earns interest at rate 
r(t) at time t. The initial investment of 1 at time 0 grows to a value of 


n(t)— 
B*(t) = Jii 1+6;L,;(T;)] 


at time t. Following Jamshidian [197], we take this as numeraire asset and call 
the associated measure the spot measure. 

Suppose, then, that (3.108) holds under the spot measure, meaning that W 
is a standard Brownian motion under that measure. ‘The absence of arbitrage 
restricts the dynamics of the forward LIBOR rates through the condition that 
bond prices be martingales when deflated by the numeraire asset. (We use the 
term “deflated” rather than “discounted” to emphasize that we are dividing 
by the numeraire asset and not discounting at a continuously compounded 
rate.) From (3.107) and the expression for B*, we find that the deflated bond 


price D,(t) = Bn (t)/B*(t) is given by 


n(t)—1 n—-1 
1 1 
Dt = [| m [| —, 0<t<T,. 3.109 
( ) a 1+ (1) Pare 1+ 6,L,;(t) ( ) 


Notice that the spot measure numeraire B* cancels the factor By) (t) used in 
(3.107) to discount between tenor dates. We are thus left in (3.109) with an 
expression defined purely in terms of the LIBOR rates. This would not have 
been the case had we divided by the risk-neutral numeraire asset ((t). 

We require that the deflated bond prices D, be positive martingales 
and proceed to derive the restrictions this imposes on the LIBOR dynam- 
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ics (3.108). If the deflated bonds are indeed positive martingales, we may 


write 


dDn+1(t) 
Dn+1(t) 
for some #¢-valued processes V;,41 which may depend on the current level of 
(D2,..., Dm+1) (equivalently, of (L1,..., Lm)). By Itô’s formula, 


dlog Da (t) = —4llun4i(t)|| dt + vp (©) W (2). 


= my1 (t) dW(t), n=1,...,M, 


We may therefore express 1,41 by finding the coefficient of dW in 


nr 


dlog Dn+1(t) =- X` dlog(1 + d;L;(t)); 


j=n(t) 


notice that the first factor in (3.109) is constant between maturities T;. Ap- 
plying It6’s formula and (3.108), we find that 


mads- > ost) (3.110) 


We now proceed by induction to find the un in (3.108). Setting D(t) = 
B,(0), we make Dı constant and hence a martingale without restrictions on 
any of the LIBOR rates. Suppose now that p,...,fn—1 have been chosen 
consistent with the martingale condition on Dn. From the identity D,(t) = 
Dn+1(1 +ônLn(t)), we find that 6,D7(t)Dn41(t) = Dn(t) — Dn41(t), so Dn41 
is a martingale if and only if L, Dy+1 is a martingale. Applying Itô’s formula, 


we get 


d(LnDn4+1) 
= Dn41 dLn + Ln dDn+1 + LnDn+1Vp410n dt 
= (Daxipalsy + be Diaave 40%) dt + Ln Disso) aw T Ln AD p41. 


(We have suppressed the time argument to lighten the notation.) To be con- 
sistent with the martingale restriction on Dn+1 and Ln Dn+1, the dt coefficient 
must be zero, and thus 

Hn = =O, Vn+1; 


notice the similarity to the HJM drift (3.90). Combining this with (3.110), we 
arrive at 


E A S 
TOD sae (3.111) 


j=n(t) 


as the required drift parameter in (3.108), so 


din(t)  < “en O iton (T AW (0) 0<t<Tp, (3.112) 


E(t) j=n(t) 
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= 1,...,M, describes the arbitrage-free dynamics of forward LIBOR rates 
under the spot measure. This formulation is from Jamshidian [197], which 
should be consulted for a rigorous and more general development. 


Forward Measure 


As in Musiela and Rutkowski [274], we may alternatively formulate a LIBOR 
market model under the forward measure Pm+1 for maturity Tm+ı and take 
the bond Bjy+1 as numeraire asset. In this case, we redefine the deflated bond 
prices to be the ratios D,(t) = B,(t)/Byy4i(t), which simplify to 


M 
D,(t)= |] (1+ 4;L;(t)). (3.113) 


j=n+1 


Notice that the numeraire asset has once again canceled the factor Bye) (t), 
leaving an expression that depends solely on the forward LIBOR rates. 

We could derive the dynamics of the forward LIBOR rates under the for- 
ward measure through the Girsanov Theorem and (3.112), much as we did 
in the HJM setting to arrive at (3.93). Alternatively, we could start from the 
requirement that the Dn in (3.113) be martingales and proceed by induction 
(backwards from n = M) to derive restrictions on the evolution of the Dn. 
Either way, we find that the arbitrage-free dynamics of the Ln, n = 1,..., M, 
under the forward measure Pm+1 are given by 


dLn(t) = 6; L;(t)on(t)' 0; (t) Tee 
eke SS ton) aw), Ot 7, 
(3.114) 


with W™+! a standard d-dimensional Brownian motion under Pm+1. The 
relation between the drift in (3.114) and the drift in (3.112) is analogous to 
the relation between the risk-neutral and forward-measure drifts in the HJM 


setting; compare (3.90) and (3.93). 
If we take n = M in (3.114), we find that 


dling (t) 
Lu (t) 


= om (t)! dW (t), 


so that, subject only to regularity conditions on its volatility, Lm is a mar- 
tingale under the forward measure for maturity Tm+1. Moreover, if oy is 
deterministic then Lm(t) has lognormal distribution LN (—53;(t)/2, 52, (t)) 


with 
öm(t) =4/ : / buo ie (3.115) 


In fact, the choice of M is arbitrary: each Ln is a martingale (lognormal if on 
is deterministic) under the forward measure P,,;1 associated with Tn+1. 
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These observations raise the question of whether we may in fact take the 
coefficients on to be deterministic in (3.112) and (3.114). Recall from Exam- 
ple 3.6.3 that this choice (deterministic proportional volatility) is inadmissible 
in the HJM setting, essentially because it makes the HJM drift quadratic in 
the current level of rates. To see what happens with simple compounding, 


rewrite (3.112) as 


is Lii On To; 
dL,,(t) = 2 a ae ea dt + Ln(t)on(t)' dW(t) (3.116) 


and consider the case of deterministic o;. The numerators in the drift are 
quadratic in the forward LIBOR rates, but they are stabilized by the terms 
1+6,L,;(t) in the denominators; indeed, because L,(t) > 0 implies 


| bj.L;(t) 


<i 
1 + 6; L;(¢) = 


the drift is linearly bounded in L,,(t), making deterministic ø; admissible. 
This feature is lost in the limit as the compounding period 6; decreases to 
zero. Thus, the distinction between continuous and simple forward rates turns 
out to have important mathematical as well as practical implications. 


3.7.2 Pricing Derivatives 


We have noted two important features of LIBOR market models: they are 
based on observable market rates, and (in contrast to the HJM framework) 
they admit deterministic volatilities øj. A third important and closely related 
feature arises in the pricing of interest rate caps. 

Recall from Example 3.6.5 (or Appendix C.2) that a cap is a collection 
of caplets and that each caplet may be viewed as a call option on a simple 
forward rate. Consider, then, a caplet for the accrual period [Tn, 7,41]. The 
underlying rate is Ln and the value Ln(Tn) is fixed at Tn. With a strike of K, 
the caplet’s payoff is én (Ln(Tn) — K)*; think of the caplet as refunding the 
amount by which interest paid at rate Ln(Tn) exceeds interest paid at rate 
K. This payoff is made at 7,41. 

Let Cn(t) denote the price of this caplet at time t; we know the terminal 
value Cr(Tn41) = 6n(Ln(Tn) — K)* and we want to find the initial value 
C,,(0). Under the spot measure, the deflated price C,,(t)/B*(t) must be a 
martingale, so 


Ca (0) = B*(0)E* bn(Ln(Tr) B -5 


B *(Ta+1) 
where we have written E* for expectation under the spot measure. Through 


B*(T,41), this expectation involves the joint distribution of Li (Ti), ..., 
En(T,), making its value difficult to discern. In contrast, under the forward 
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measure P,41 associated with maturity T,,41, the martingale property applies 
to On(t)/Bn4i(t). We may therefore also write 


Ôn (LalIn) p aa 
Bn+1 (Tn+1) 


with E,4, denoting expectation under P „+1. Conveniently, Bn+1ı(Tn41) = 1, 
so this expectation depends only on the marginal distribution of Ln (Tn). If 
we take on to be deterministic, then Ln(Tn) has the lognormal distribution 
LN(—a@2(Tn)/2,52(Tp)), using the notation in (3.115). In this case, the caplet 
price is given by the Black formula (after Black [49]), 


Cn(0) = BC(Ln (0), En (Tn), Tn, K, ôn B41 (9), 


Cn (0) = Bn+1(0)En41 


with 
BC(F, o, T, K,b) = 


y (ro (PEELO TI?) py (BEIR TIN) ary 


and ® the cumulative normal distribution. Thus, under the assumption of de- 
terministic volatilities, caplets are priced in closed form by the Black formula. 

This formula is frequently used in the reverse direction. Given the market 
price of a caplet, one can solve for the “implied volatility” that makes the 
formula match the market price. This is useful in calibrating a model to market 
data, a point we return to in Section 3.7.4. 

Consider, more generally, a derivative security making a payoff of g(L(Tùn)) 
at Tk, with L(T,,) = (Ly (Tı), ON ope ni); Lal Lr; R Lm(Thn)) and k > 
n. The price of the derivative at time 0 is given by 


r Tea | 


(using the fact that B*(0) = 1), and also by 


oe í a: | Bm (Tk) 


for every m > k. Which measure and numeraire are most convenient depends 
on the payoff function g. However, in most cases, the expectation cannot be 
evaluated explicitly and simulation is required. 

As a further illustration, we consider the pricing of a swaption as described 
in Example 3.6.6 and Appendix C.2. Suppose the underlying swap begins at 
Tn with fixed- and floating-rate payments exchanged at Tn+1,-.., TMm+1. From 
equation (C.7) in Appendix C, we find that the forward swap rate at time t 
is given by 
a (3.118) 
3 ôB; (t) 


q=n4+1 


onl) = 
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Using (3.107) and noting that By (t) cancels from the numerator and de- 
nominator, this swap rate can be expressed purely in terms of forward LIBOR 
rates. 

Consider, now, an option expiring at time Tk < Thn to enter into the swap 
over [T,, Tm+1] with fixed rate T. The value of the option at expiration can 
be expressed as (cf. equation (C.11)) 


M+1 
SO 6; B;(Tk)(R — Sn (Th))*. 


j=n+1 


This can be written as a function g(L(TķĻ)) of the LIBOR. rates. The price 
at time zero can therefore be expressed as an expectation using the general 
expressions above. 

By applying Itô’s formula to the swap rate (3.118), it is not difficult to 
conclude that if the forward LIBOR rates have deterministic volatilities, then 
the forward swap rate cannot also have a deterministic volatility. In particu- 
lar, then, the forward swap rate cannot be geometric Brownian motion under 
any equivalent measure. Brace et al. [56] nevertheless use a lognormal ap- 
proximation to the swap rate to develop a method for pricing swaptions; their 
approximation appears to give excellent results. An alternative approach has 
been developed by Jamshidian [197]. He develops a model in which the term 
structure is described through a vector (So(t),..., Sm(t)) of forward swap 
rates. He shows that one may choose the volatilities of the forward swap rates 
to be deterministic, and that in this case swaption prices are given by a variant 
of the Black formula. However, in this model, the LIBOR rates cannot also 
have deterministic volatilities, so caplets are no longer priced by the Black 
formula. One must therefore choose between the two pricing formulas. 


3.7.3 Simulation 


Pricing derivative securities in LIBOR market models typically requires sim- 
ulation. As in the HJM setting, exact simulation is generally infeasible and 
some discretization error is inevitable. Because the models of this section deal 
with a finite set of maturities from the outset, we need only discretize the 
time argument, whereas in the HJM setting both time and maturity required 
discretization. 

We fix a time grid 0 = to < t1 < +--+ < tm < tm+1 over which to simulate. It 
is sensible to include the tenor dates T),..., 744.1 among the simulation dates. 
In practice, one would often even take t; = Ti so that the simulation evolves 
directly from one tenor date to the next. We do not impose any restrictions 
on the volatilities on, though the deterministic case is the most widely used. 
The only other specific case that has received much attention takes o,,(t) to 
be the product of a deterministic function of time and a function of L(t) as 
proposed in Andersen and Andreasen [13]. For example, one may take oc, (t) 
proportional to a power of Ln (t), resulting in a CEV-type of volatility. In either 
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this extension or in the case of deterministic volatilities, it often suffices to 
restrict the dependence on time to piecewise constant functions that change 
values only at the 7;. We return to this point in Section 3.7.4. 

Simulation of forward LIBOR rates is a special case of the general prob- 
lem of simulating a system of SDEs. One could apply an Euler scheme or a 
higher-order method of the type discussed in Chapter 6. However, even if we 
restrict ourselves to Euler schemes (as we do here), there are countless alter- 
natives. We have many choices of variables to discretize and many choices of 
probability measure under which to simulate. Several strategies are compared 
both theoretically and numerically in Glasserman and Zhao [151], and the 
discussion here draws on that investigation. 

The most immediate application of the Euler scheme under the spot mea- 
sure discretizes the SDE (3.116), producing 


En(tit1) = Ln(ti) + Un(L(ti), te) Ln (ti) [tina — ti] 
+ Ln(ti)/ti+1 — tion(ti) Zaa (3.119) 


with : : 
n 0;L3(t;)on(t;)' olti 
Un(È(ti), ti) = >. J j (ti) (ti) ak ) 
Jae EG) 
and Z1, Z2,... independent N(0,J) random vectors in RÊ. Here, as in Sec- 
tion 3.6.2, we use hats to identify discretized variables. We assume that we 
are given an initial set of bond prices B,(0),..., Bm+1(0) and initialize the 


simulation by setting 


By (0) sz Bn+1 (0) 


ma bessa M; 
ôn Bn+1(0) 


Lao) ar 


in accordance with (3.105). 
An alternative to (3.119) approximates the LIBOR rates under the spot 


measure using 


Înlti+1) = Ên(t:i) X 
exp (| ten (E(t), ts) = Hont] [tizi — ti] + tiga — tion (ti) Zaa) . 
(3.120) 


This is equivalent to applying an Euler scheme to log Ln; it may also be viewed 
as approximating Ln» by geometric Brownian motion over |t;,¢;41], with drift 
and volatility parameters fixed at t;. This method seems particularly attractive 
in the case of deterministic cn, since then Ln is close to lognormal. A further 
property of (3.120) is that it keeps all L,, positive, whereas (3.119) can produce 
negative rates. 

For both of these algorithms it is important to note that our definition 
of 7 makes 7 right-continuous. For the original continuous-time processes we 
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could just as well have taken 7 to be left-continuous, but the distinction is 
important in the discrete approximation. If t; = Tk, then n(t;) = k+1 and the 
sum in each jn (L(t;), ti) starts at k+1. Had we taken 7 to be left-continuous, 
we would have 7(7;) = k and thus an additional term in each py. It seems 
intuitively more natural to omit this term as time advances beyond Tk since 
Lp ceases to be meaningful after Tp. Glasserman and Zhao [151] and Sidenius 
[330] both find that omitting it (i.e., taking 7 right-continuous) results in 
smaller discretization error. 

Both (3.119) and (3.120) have obvious counterparts for simulation under 


the forward measure Pyy+i. The only modification necessary is to replace 
pn (L(t;), ti) with 


: le eG aaa: 
bn(L(ti), ti) = — ` eae eo o 


jg=nt+l 


Notice that um = 0. It follows that if the oy is deterministic and constant 
between the ¢; (for example, constant between tenor dates), then the log Euler 
scheme (3.120) with wa = 0 simulates Lm without discretization error under 
the forward measure Pm+1. None of the Ln is simulated without discretiza- 
tion error under the spot measure, but we will see that the spot measure is 
nevertheless generally preferable for simulation. 


Martingale Discretization 


In our discussion of simulation in the HJM setting, we devoted substantial at- 
tention to the issue of choosing the discrete drift to keep the model arbitrage- 
free even after discretization. It is therefore natural to examine whether an 
analogous choice of drift can be made in the LIBOR rate dynamics. In the 
HJM setting, we derived the discrete drift from the condition that the dis- 
cretized discounted bond prices must be martingales. In the LIBOR market 
model, the corresponding requirement is that 


m—1 

š 1 

Dalt) = ]] —— (3.121) 
j=0 1+ ôj Lj (ti A T;) 


be a martingale (in i) for each n under the spot measure; see (3.109). Under 
the forward measure, the martingale condition applies to 


M 
D,(t) = [I € + 5;L5(ti) , (3.122) 


see (3.113). 
Consider the spot measure first. We would like, as a special case of (3.121), 


for 1/(1 + 6,£1) to be a martingale. Using the Euler scheme (3.119), this 
requires 
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1 1 
Ee EA, E EEE 
EE + pity EE 1 + ôı Lı (0) 


the expectation taken with respect to Zı ~ N(0, I). However, because the 
denominator inside the expectation has a normal distribution, the expecta- 
tion is infinite no matter how we choose u1. There is no discrete drift that 
preserves the martingale property. If, instead, we use the method in (3.120), 
the condition becomes 


1 1 
E Se ee ee 
f + ô1 (L1 (0) exp([H1 — Ilo l|?/2]tı + a 1 + ô1 L1 (0) 


In this case, there is a value of uı for which this equation holds, but there 
is no explicit expression for it. The root of the difficulty lies in evaluating an 


expression of the form 


1 
. f + exp(a+ zl ESE 

which is effectively intractable. In the HJM setting, calculation of the discrete 

drift relies on evaluating far more convenient expressions of the form E[exp(a+ 

bZ)|; see the steps leading to (3.97). 

Under the forward measure, it is feasible to choose pı so that Də in (3.122) 
is a martingale using an Euler scheme for either Lı or log Lı. However, this 
quickly becomes cumbersome for Dn, with larger values of n. As a practical 
matter, it does not seem feasible under any of these methods to adjust the 
drift to make the deflated bond prices martingales. A consequence of this is 
that if we price bonds in the simulation by averaging replications of (3.121) 
or (3.122), the simulation price will not converge to the corresponding B,,(0) 
as the number of replications increases. 

An alternative strategy is to discretize and simulate the deflated bond 
prices themselves, rather than the forward LIBOR rates. For example, under 
the spot measure, the deflated bond prices satisfy 


dADn+i(t) _ . _ 95 £3 (t) g. 
a 2 ( | ) T(t) dWi(t) 


i+1(t) 
= = (Seu J Da 1) o} (t) dW (t). (3.123) 
j=n(t) d 
An Euler scheme for log Dn+ı therefore evolves according to 
Dn41 (ti+1) T 
Daaa (ti) exp (iinr tol tea — ti] + Vti — Bôn (ti) Zits) (3.124) 


with 
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in+i(ts) = Be. (ea = 7 Oj (ti). (3.125) 


In either case, the discretized deflated bond prices are automatically mar- 
tingales; in (3.124) they are positive martingales and in this sense the dis- 
cretization is arbitrage-free. From the simulated Dn (ti) we can then define 
the discretized forward LIBOR rates by setting 


Lat) ETER Dn (ti) = Dn+1 (ti) 
Ôn Dn+4i(ti) 
for n = 1,...,M. Any other term structure variables (¢.g., swap rates) re- 


quired in the simulation can then be defined from the Ly. 
Glasserman and Zhao [151] recommend replacing 


A A + 
Gan = 7 with min Grog TG Ya (3.126) 
D; (ti) D; (ti) 


This modification has no effect in the continuous-time limit because 0 < 
Djai(t) < D;(t) Gf L;(t) > 0). But in the discretized process the ratio 


D;+41/D,; could potentially exceed 1. 
Under the forward measure Pm+1, the deflated bond prices (3.113) satisfy 


dDn+1(t) ~ 6jL;(t) T aut 
— = sd a S d 
Dry+il(t) jon 1+ Lale) (0) (t) 


"D (1-20) o} (t) dW (t). (3.127) 


j=n+1. DaN) 


We can again apply an Euler discretization to the logarithm of these variables 
to get (3.124), except that now 


M A 
Pr+i(t;) a `> k B eit oj (ti), 


j=nt+1 


possibly modified as in (3.126). 
Glasserman and Zhao [151] consider several other choices of variables for 
discretization, including (under the spot measure) the normalized differences 
D (t) = Dy+i(t) 
V(t) = + Sint ag 
n( ) B; (0) ’ n ’ ’ ’ 
these are martingales because the deflated bond prices are martingales. They 
satisfy 
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n~1 
Vn +Vn-1 +: +V -1 a Vj F 
Sac Lag —_ dW, 
( Vastu ee Ved mt) Veep ap eatery gee |, ca 
with the convention 0,44; = 0. Forward rates are recovered using 


Vin (t) 


6,L,(t) = m. 
) Vn41(t) +--+ + Vail) 


Similarly, the variables 
M 
Ôn Xn (t) = Ôn Ln (t) lI (1 aT Ô;jLj (t)) 
j=n+1 


are differences of deflated bond prices under the forward measure Pm+ı and 
thus martingales under that measure. The Xn satisfy 


dX. ô; Xo} 
Se ee er Pea Cs sk eee WAY, Vek tea 
Xn dee cee “Féuku | 


Forward rates are recovered using 


Xn 


Tig ee a; 
"1+ bp Xngr + t+bu Xm 


Euler discretizations of log Vn and log X, preserve the martingale property 
and thus keep the discretized model arbitrage-free. 


Pricing Derivatives 


The pricing of a derivative security in a simulation proceeds as follows. Using 
any of the methods considered above, we simulate paths of the discretized 
variables L1,..., Êm. Suppose we want to price a derivative with a payoff of 
g(L(T,,)) at time Tn. Under the spot measure, we simulate to time TJ, and 


then calculate the deflated payoff 


. = 1 
l 1 + 6;.L;(T5) 


Averaging over independent replications produces an estimate of the deriva- 
tive’s price at time 0. If we simulate under the forward measure, the estimate 


consists of independent replications of 
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9(L(In)) + Bm4 (0) | [C + 8T). 


Ja 


Glasserman and Zhao [151] compare various simulation methods based, in 
part, on their discretization error in pricing caplets. For the case of a caplet 
over [T,-1, Tn], take g(x) = 6,-1(a — K)* in the expressions above. If the cj 
are deterministic, the caplet price is given by the Black formula, as explained 
in Section 3.7.2. However, because of the discretization error, the simulation 
price will not in general converge exactly to the Black price as the number 
of replications increase. The bias in pricing caplets serves as a convenient 
indication of the magnitude of the discretization error. 

Figure 3.19, reproduced from Glasserman and Zhao [151], graphs biases 
in caplet pricing as a function of caplet maturity for various simulation meth- 
ods. The horizontal line through the center of each panel corresponds to zero 
bias. The error bars around each curve have halfwidths of one standard error, 
indicating that the apparent biases are statistically significant. Details of the 
parameters used for these experiments are reported in Glasserman and Zhao 
[151] along with several other examples. 

These and other experiments suggest the following observations. The 
smallest biases are achieved by simulating the differences of deflated bond 
prices (the V, in the spot measure and the Xn in the forward measure) using 
an Euler scheme for the logarithms of these variables. (See Glasserman and 
Zhao [151] for an explanation of the modified V, method.) An Euler scheme 
for log Dn is nearly indistinguishable from an Euler scheme for Ln. Under the 
forward measure Pm+1, the final caplet is priced without discretization error 
by the Euler schemes for log Xn and log Ln; these share the feature that they 
make the discretized rate Êm lognormal. 

The graphs in Figure 3.19 compare discretization biases but say noth- 
ing about the relative variances of the methods. Glasserman and Zhao [151] 
find that simulating under the spot measure usually results in smaller vari- 
ance than simulating under the forward measure, especially at high levels of 
volatility. An explanation for this is suggested by the expressions (3.109) and 
(3.113) for the deflated bond prices under the two measures: whereas (3.109) 
always lies between 0 and 1, (3.113) can take arbitrarily large values. This 
affects derivatives pricing through the discounting of payofts. 


3.7.4 Volatility Structure and Calibration 


In our discussion of LIBOR market models we have taken the volatility factors 
o,(t) as inputs without indicating how they might be specified. In practice, 
these coefficients are chosen to calibrate a model to market prices of actively 
traded derivatives, especially caps and swaptions. (The model is automatically 
calibrated to bond prices through the relations (3.105) and (3.106).) Once the 
model has been calibrated to the market, it can be used to price less liquid 
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Fig. 3.19. Comparison of biases in caplet pricing for various simulation methods. 
Top panel uses spot measure; method A is an Euler scheme for Ln and methods 
B-E are Euler schemes for log variables. Bottom panel uses the forward measure 
Pm+1; method A is an Euler scheme for Ln and methods B-D are Euler schemes 


for log variables. 


instruments for which market prices may not be readily available. Accurate 
and efficient calibration is a major topic in its own right and we can only touch 
on the key issues. For a more extensive treatment, see James and Webber [194] 
and Rebonato [303]. Similar considerations apply in both the HJM framework 
and in LIBOR market models; we discuss calibration in the LIBOR setting 
because it is somewhat simpler. Indeed, convenience in calibration is one of 
the main advantages of this class of models. 

The variables o,,(t) are the primary determinants of both the level of 
volatility in forward rates and the correlations between forward rate. It is 
often useful to distinguish these two aspects and we will consider the overall 
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level of volatility first. Suppose we are given the market price of a caplet for 
the interval [Tn, Tn+1] and from this price we calculate an implied volatility 
Un by inverting the Black formula (3.117). (We can assume that the other 
parameters of the formula are known.) If we choose øn to be any deterministic 
R4_valued function satisfying 


E i 2 2 
E lon de= 08 


then we know from the discussion in Section 3.7.2 that the model is calibrated 
to the market price of this caplet, because the model’s caplet price is given 
by the Black formula with implied volatility equal to the square root of the 
expression on the left. By imposing this constraint on all of the gj, we ensure 
that the model is calibrated to all caplet prices. (As a practical matter, it 
may be necessary to infer the prices of individual caplets from the prices of 
caps, which are portfolios of caplets. For simplicity, we assume caplet prices 
are available.) 

Because LIBOR market models do not specify interest rates over accrual 
periods shorter than the intervals [T;,7j1], it is natural and customary to 
restrict attention to functions o,(t) that are constant between tenor dates. 
We take each on to be right-continuous and thus denote by o,(7;) its value 
over the interval [T7;,7;,1). Suppose, for a moment, that the model is driven 
by a scalar Brownian motion, so d = 1 and each oy» is scalar valued. In this 
case, it is convenient to think of the volatility structure as specifed through a 
lower-triangular matrix of the following form: 


O1 (To) 
02(To) o2(T1) 


auton) = om(Tm-1) 


The upper half of the matrix is empty (or irrelevant) because each Ln (t) ceases 
to be meaningful for t > Tn. In this setting, we have 


its 
f On(t) dt = op (To)d0 + o7,(T1)d1 +: + oh (Tn-1)8n—1, 
O 


so caplet prices impose a constraint on the sums of squares along each row of 


the matrix. 

The volatility structure is stationary if o,(t) depends on n and t only 
through the difference Tn —t. For a stationary, single-factor, piecewise constant 
volatility structure, the matrix above takes the form 


o(1) 
o(2) a(l) 


m E) 
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for some values o(1),...,0(M). (Think of o(7) as the volatility of a forward 
rate į periods away from maturity.) In this case, the number of variables just 
equals the number of caplet maturities to which the model may be calibrated. 
Calibrating to additional instruments requires introducing nonstationarity or 
additional factors. 

In a multifactor model (i.e., d > 2) we can think of replacing the entries 
on(T;) in the volatility matrix with the norms ||o,,(T;)||], since the o,(Z;) are 
now vectors. With piecewise constant values, this gives 


T 
/ lon (t) ||" dt = llon(Zo)II"50 + llon (Ti) 1 ++ + [lon (Tn—1) IP On—15 


so caplet implied volatilities continue to constrain the sums of squares along 
each row. This also indicates that taking d > 2 does not provide additional 
flexibility in matching these implied volatilities. 

The potential value of a multifactor model lies in capturing correlations 
between forward rates of different maturities. For example, from the Euler 
approximation in (3.120), we see that over a short time interval the correlation 
between the increments of log L;(t) and log L(t) is approximately 


Arlt)’ olt) 

lor ll lo O 
These correlations are often chosen to match market prices of swaptions 
(which, unlike caps, are sensitive to rate correlations) or to match histori- 
cal correlations. 

In the stationary case, we can visualize the volatility factors by graphing 
them as functions of time to maturity. This can be useful in interpreting the 
correlations they induce. Figure 3.20 illustrates three hypothetical factors in 
a model with M = 15. Because the volatility is assumed stationary, we may 
write o,(T;) = a(n — i) for some vectors o(1),...,o(M). In a three-factor 
model, each a(i) has three components. The three curves in Figure 3.20 are 
graphs of the three components as functions of time to maturity. If we fix a 
time to maturity on the horizontal axis, the total volatility at that point is 
given by the sums of squares of the three components; the inner products of 
these three-dimensional vectors at different times determine the correlations 
between the forward rates. 

Notice that the first factor in Figure 3.20 has the same sign for all matu- 
rities; regardless of the sign of the increment of the driving Brownian motion, 
this factor moves all forward rates in the same direction and functions approx- 
imately as a parallel shift. The second factor has values of opposite signs at 
short and long maturities and will thus have the effect of tilting the forward 
curve (up if the increment in the second component of the driving Brownian 
motion is positive and down if it is negative). The third factor bends the for- 
ward curve by moving intermediate maturities in the opposite direction of long 
and short maturities, the direction depending on the sign of the increment of 
the third component of the driving Brownian motion. 
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Fig. 3.20. Hypothetical volatility factors. 


The hypothetical factors in Figure 3.20 are the first three principal com- 
ponents of the matrix 


0.127 exp((—0.8,\/|i — j])), 4,7 =1,...,15. 


More precisely, they are the first three eigenvectors of this matrix as ranked by 
their eigenvalues, scaled to have length equal to their eigenvalues. It is common 
in practice to use the principal components of either the covariance matrix 
or the correlation matrix of changes in forward rates in choosing a factor 
structure. Principal components analysis typically produces the qualitative 
features of the hypothetical example in Figure 3.20; see, e.g., the examples in 
James and Webber [194] or Rebonato [304]. 

An important feature of LIBOR market models is that a good deal of 
calibration can be accomplished through closed form expressions or effective 
approximations for the prices of caps and swaptions. This makes calibration 
fast. In the absence of formulas or approximations, calibration is an iterative 
procedure requiring repeated simulation at various parameter values until the 
model price matches the market. Because each simulation can be quite time 
consuming, calibration through simulation can be onerous. 


A 


Variance Reduction Techniques 


This chapter develops methods for increasing the efficiency of Monte Carlo 
simulation by reducing the variance of simulation estimates. These meth- 
ods draw on two broad strategies for reducing variance: taking advantage of 
tractable features of a model to adjust or correct simulation outputs, and 
reducing the variability in simulation inputs. We discuss control variates, 
antithetic variates, stratified sampling, Latin hypercube sampling, moment 
matching methods, and importance sampling, and we illustrate these meth- 
ods through examples. Two themes run through this chapter: 


o The greatest gains in efficiency from variance reduction techniques result 
from exploiting specific features of a problem, rather than from generic 
applications of generic methods. 

o Reducing simulation error is often at odds with convenient estimation of the 
simulation error itself; in order to supplement a reduced-variance estimator 
with a valid confidence interval, we sometimes need to sacrifice some of the 
potential variance reduction. 


The second point applies, in particular, to methods that introduce dependence 
across replications in the course of reducing variance. 


4.1 Control Variates 


4.1.1 Method and Examples 


The method of control variates is among the most effective and broadly ap- 
plicable techniques for improving the efficiency of Monte Carlo simulation. 
It exploits information about the errors in estimates of known quantities to 
reduce the error in an estimate of an unknown quantity. 

To describe the method, we let Y,,...,Y, be outputs from n replications 
of a simulation. For example, Y; could be the discounted payoff of a derivative 
security on the ith simulated path. Suppose that the Y; are independent and 
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identically distributed and that our objective is to estimate E[Y;]. The usual 
estimator is the sample mean Y = (Y;+---+Y,,)/n. This estimator is unbiased 


and converges with probability 1 as n — oo. 

Suppose, now, that on each replication we calculate another output X; 
along with Y;. Suppose that the pairs (X;, Y;), i = 1,...,n, are i.i.d. and that 
the expectation E[X] of the X; is known. (We use (X,Y) to denote a generic 
pair of random variables with the same distribution as each (X;, Y;).) Then 
for any fixed b we can calculate 


Y;(b) = Yi — (Xs — E[X]) 
from the zth replication and then compute the sample mean 


ro ig 
Y(b) = Ý — b(X — E[X]) = DD — b(X; — E[X])). (4.1) 
i=1 
This is a control variate estimator; the observed error X — ELX] serves as a 


control in estimating E[Y]. 
As an estimator of E[Y], the control variate estimator (4.1) is unbiased 


because 
E[Y(0)] = E [Ý — (X — E[X])] = E[7] = EY] 
and it is consistent because, with probability 1, 
ig le 
lim = Yl) jim, > dh — o(X; — E[X])) 
EY — d(X — EIX] 
= EY), 


| 


Each Y;(b) has variance 


Var[Y;(b)| = Gi |Y; — b( X; — ELX])] 
= 0% — 2boxoypxy + bos% = o° (b), (4.2) 


where o% = Var[X], o2 = Var[Y], and pxy is the correlation between X and 
Y. The control variate estimator Y(b) has variance o?(b)/n and the ordinary 
sample mean Y (which corresponds to b = 0) has variance oĉ /n. Hence, the 
control variate estimator has smaller variance than the standard estimator if 
boy < 2boy pxy. 

The optimal coefficient b* minimizes the variance (4.2) and is given by 


b* — OY = Cov| X, Y| 


z P*Y = Varix] (4.3) 


Substituting this value in (4.2) and simplifying, we find that the ratio of 
the variance of the optimally controlled estimator to that of the uncontrolled 


estimator is 
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Var[Y — b*( X — E[X])] 
Var[Y] 


A few observations follow from this expression: 


ae pry: (4.4) 


o With the optimal coefficient b*, the effectiveness of a control variate, as mea- 
sured by the variance reduction ratio (4.4), is determined by the strength of 
the correlation between the quantity of interest Y and the control X. The 
sign of the correlation is irrelevant because it is absorbed in b*. 

o If the computational effort per replication is roughly the same with and 
without a control variate, then (4.4) measures the computational speed-up 
resulting from the use of a control. More precisely, the number of replications 
of the Y; required to achieve the same variance as n replications of the 
control variate estimator is n/(1— py). 

o The variance reduction factor 1/(1 — p%,-) increases very sharply as |pxy| 
approaches 1 and, accordingly, it drops off quickly as |pxy| decreases away 
from 1. For example, whereas a correlation of 0.95 produces a ten-fold speed- 
up, a correlation of 0.90 yields only a five-fold speed-up; at |oxy| = 0.70 
the speed-up drops to about a factor of two. This suggests that a rather 
high degree of correlation is needed for a control variate to yield substantial 


benefits. 


These remarks and equation (4.4) apply if the optimal coefficient b* is 
known. In practice, if E[Y] is unknown it is unlikely that oy or pxy would 
be known. However, we may still get most of the benefit of a control variate 
using an estimate of b*. For example, replacing the population parameters in 
(4.3) with their sample counterparts yields the estimate 


p aK- XY SY) 
í rn. es 


Dividing numerator and denominator by n and applying the strong law of 
large numbers shows that bn — b* with Poa. 1. This suggests aa 
the estimator Y(b,), the sample mean of Y;(bn) = Y; — bn(X; — E[X]), i = 
1,...,n. Replacing b* with b,, introduces some ae we return to this point 
in Section 4.1.3. 

The expression in (4.5) is the slope of the least-squares regression line 
through the points (X;, Y;), 7 = 1,...,n. The link between control variates 
and regression is useful in the statistical analysis of control variate estimators 
and also permits a graphical interpretation of the method. Figure 4.1 shows 
a hypothetical scatter plot of simulation outputs (X;, Y;) and the estimated 
regression line for these points, which passes through the point (X,Y). In the 
figure, X < E[X], indicating that the n replications have underestimated E[X]. 
If the X; and Y; are positively correlated, this suggests that the simulation 
estimate Y likely underestimates E[Y]. This further suggests that we should 
adjust the estimator upward. The regression line determines the magnitude of 
the adjustment; in particular, Y (b„) is the value fitted by the regression line 
at the noint EIX]. 


(4.5) 
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Fig. 4.1. Regression interpretation of control variate method. The regression line 
through the points (Xi, Yi) has slope b, and passes through (X,Y). The control 
variate estimator Ý (bn) is the value fitted by the line at E[X]. In the figure, the 
sample mean X underestimates E[X] and Y is adjusted upward accordingly. 


Examples 


To make the method of control variates more tangible, we now illustrate it 
with several examples. 


Example 4.1.1 Underlying assets. In derivative pricing simulations, under- 
lying assets provide a virtually universal source of control variates. We know 
from Section 1.2.1 that the absence of arbitrage is essentially equivalent to 
the requirement that appropriately discounted asset prices be martingales. 
Any martingale with a known initial value provides a potential control vari- 
ate precisely because its expectation at any future time is its initial value. To 
be concrete, suppose we are working in the risk-neutral measure and suppose 
the interest rate is a constant r. If S(t) is an asset price, then exp(—rt)S(¢) is 
a martingale and Elexp(—r7')S(T)| = S(0). Suppose we are pricing an option 
on S with discounted payoff Y, some function of {S(t),0 < t < T}. From 
independent replications $;, i = 1,..., n, each a path of S over [0,7], we can 
form the control variate estimator 


PY bS T = eT S0; 


or the corresponding estimator with b replaced by bn. If Y = e™"T(S(T)-K)+t 


so that we are pricing a standard call option, the correlation between Y and 
S(T) and thus the effectiveness of the control variate depends on the strike 
K. At K = 0 we would have perfect correlation; for an option that is deep 
out-of-the-money (i.e., with large K), the correlation could be quite low. This 


d 
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is illustrated in Table 4.1 for the case of S ~ GBM(r,o7) with parameters 
r = 5%, o = 30%, S(0) = 50, and T = 0.25. This example shows that the 
effectiveness of a control variate can vary widely with the parameters of a 


problem. O 


40 45 50 55 60 65 70 
0.995 0.968 0.895 0.768 0.604° 0.433 0.286 
0.99 0.94 0.80 0.59 0.36 0.19 0.08 
Table 4.1. Estimated correlation ô between S(T) and (S(T) — K)* for various 
values of K, with S(0) = 50, o = 30%, r = 5%, and T = 0.25. The third row 
measures the fraction of variance in the call option payoff eliminated by using the 
underlying asset as a control variate. 


Example 4.1.2 Tractable options. Simulation is sometimes used to price 
complex options in a model in which simpler options can be priced in 
closed form. For example, even under Black-Scholes assumptions, some path- 
dependent options require simulation for pricing even though formulas are 
available for simpler options. A tractable option can sometimes provide a 
more effective control than the underlying asset. 

A particularly effective example of this idea was suggested by Kemna and 
Vorst [209] for the pricing of Asian options. Accurate pricing of an option on 
the arithmetic average 


7 nase 


requires simulation, even if S is geometric Brownian motion. In contrast, calls 
and puts on the geometric average 


A 1/n 
Se = (i seo) 


can be priced in closed form, as explained in Section 3.2.2. Thus, options on 
Sg can be used as control variates in pricing options on S4. 

Figure 4.2 shows scatter plots of simulated values of (54 — K)* against 
the terminal value of the underlying asset S(T), a standard call payoff 
(S(T) — K)*, and the geometric call payoff (Sq — K)*. The figures are based 
on K = 50 and thirteen equally spaced averaging dates; all other parameters 
are as in Example 4.1.1. The leftmost panel shows that the weak correlation 
between S4 and S(T) is further weakened by applying the call option payoff 
to Sa, which projects negative values of S4 — K to zero; the resulting corre- 
lation is approximately 0.79. The middle panel shows the effect of applying 
the call option vavoff to S(T\ as well: in this case the correlation increases 
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to approximately 0.85. The rightmost panel illustrates the extremely strong 
relation between the payofts on the arithmetic and geometric average call op- 
tions. The correlation in this case is greater then 0.99. A similar comparison 


is made in Broadie and Glasserman [67]. O 


Underlying asset Standard call 
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Fig. 4.2. Scatter plots of payoff of call option on arithmetic average against the 
underlying asset, the payoff of a standard call, and the payoff of a call on the 


geometric average. 


Example 4.1.3 Bond prices. In a model with stochastic interest rates, bond 
prices often provide a convenient source of control variates. As emphasized in 
Sections 3.3-3.4 and Sections 3.6-3.7, an important consideration in imple- 
menting an interest rate simulation is ensuring that the simulation correctly 
prices bonds. While this is primarily important for consistent pricing, as a by- 
product it makes bonds available as control variates. Bonds may be viewed as 
the underlying assets of an interest rate model, so in a sense this example is 


a special case of Example 4.1.1. 
In a model of the short rate r(t), a bond maturing at time T has initial 


price 
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B(0,T) =E e (~ / r(u) aw) 


the expectation taken with respect to the risk-neutral measure. Since we may 
assume that B(0,T) is known, the quantity inside the expectation provides 
a potential control variate. But even if r is simulated without discretization 
error at dates t;,...,t, = T, the difference 


és (-2 yr) - B(0,T) 


will not ordinarily have mean 0 because of the error in approximating the 
integral. Using this difference in a control variate estimator could therefore 
introduce some bias, though the bias can be made as small as necessary by 
taking a sufficiently fine partition of [0,7]. In our discussion of the Vasicek 
model in Section 3.3, we detailed the exact joint simulation of r(t;) and its 
time integral 


ti 
Y (t;) = f r(u) du. 
0 
This provides a bias-free control variate because 
E [exp(-Y (T))] = B(0, T). 


Similar considerations apply to the forward rate models of Sections 3.6 
and 3.7. In our discussion of the Heath-Jarrow-Morton framework, we devoted 
considerable attention to deriving the appropriate discrete drift condition. 
Using this drift in a simulation produces unbiased bond price estimates and 
thus makes bonds available as control variates. In our discussion of LIBOR 
market models, we noted that discretizing the system of SDEs for the LIBOR 
rates Ln would not produce unbiased bond estimates; in contrast, the methods 
in Section 3.7 based on discretizing deflated bonds or their differences do 
produce unbiased estimates and thus allow the use of bonds as control variates. 

Comparison of this discussion with the one in Section 3.7.3 should make 
clear that the question of whether or not asset prices can be used as control 
variates is closely related to the question of whether a simulated model is 


arbitrage-free. O 


Example 4.1.4 Tractable dynamics. The examples discussed thus far are all 
based on using one set of prices in a model as control variates for some other 
price in the same model. Another strategy for developing effective control 
variates uses prices in a simpler model. We give two illustrations of this idea. 
Consider, first, the pricing of an option on an asset whose dynamics are 
modeled by 
dS (t) 


Ss) =rdt+o(t) dW(t), 


192 4 Variance Reduction Techniques 


where o(t) may be function of S(t) or may be stochastic and described by a 
second SDE. We might simulate S at dates t),...,t, using an approximation 


of the form 
S(ti+1) = S(t) exp (Ir ~ žo(ti) (tizi — ti) + o (ti) tigi — Zits | 


where the Z; are independent N(0,1) variables. In a stochastic volatility 
model, a second recursion would determine the evolution of o(t;). Suppose 
the option we want to price would be tractable if the underlying asset were 
geometric Brownian motion. Then along with S we could simulate 


S(tis1) = S(t) exp ([r — 567 | (tis = ti) +O tipi — iss] 


for some constant õ, the same sequence Z;, and with initial condition S(0) = 
S(0). If, for example, the option is a standard call with strike K and expiration 
tn, we could form a controlled estimator using independent replications of 


(5(tn) — K)* — b (Elta) — K)+ — E (Elta) - K)+ |). 


Except for a discount factor, the expectation on the right is given by the 
Black-Scholes formula. For effective variance reduction, the constant o should 


be chosen close to a typical value of ø. 
As a second illustration of this idea, recall that the dynamics of forward 


LIBOR under the spot measure in Section 3.7 are given by 


dL (t) ~ 05 (t)" on(t)d;L;(t) f 
—— ~ = < dt+o,(t) dW (t), ms Me AG 
j=n(t) 
Suppose the on are deterministic functions of time. Along with the forward 
LIBOR rates, we could simulate auxiliary processes 


dSn (t) _ Z 
o on (t)dW(t), n=1,..., M. (4.7) 


These form a multivariate geometric Brownian motion and lend themselves 
to tractable pricing and thus to control variates. Alternatively, we could use 


dLn(t) _ <~ a(t)" on(t)ô;L;(0) + 
cary = 2 TESLA) teal) dW (t). (4.8) 


Notice that the drift in this expression is a function of the constants L,(0) 
rather than the stochastic processes L;(t) appearing in the drift in (4.6). 
Hence, Ln is also a geometric Brownian motion though with time-varying 
drift. Even if an option on the S» or La cannot be valued in closed form, if 
it can be valued quickly using a numerical procedure it may yield an effective 
control variate. 
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The evolution of Ln, Sn, and Ey is illustrated in Figure 4.3. This example 
initializes all rates at 6%, takes 6; = 0.5 (corresponding to semi-annual rates), 
and assumes a stationary specification of the volatility functions in which 
on(t) = o(n—n(t)+1) with o increasing linearly from 0.15 to 0.25. The figure 
plots the evolution of L40, S40, and Lao using a log-Euler approximation of 
the type in (3.120). The figure indicates that L4o tracks Lao quite closely and 
that even S40 is highly correlated with Lao. 
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Fig. 4.3. A sample path of Lao using the true dynamics in (4.6), the geometric 

Brownian motion approximation Lag with time-varying drift, and the driftless geo- 

metric Brownian motion Sip. 


It should be noted that simulating an auxiliary process as suggested here 
may substantially increase the time required per replication — perhaps even 
doubling it. As with any variance reduction technique, the benefit must be 
weighed against the additional computing time required, using the principles 
in Section 1.1.3. O 


Example 4.1.5 Hedges as controls. There is a close link between the se- 
lection of control variates and the selection of hedging instruments. If Y is 
a discounted payoff and we are estimating E[Y], then any instrument that 
serves as an effective hedge for Y also serves as an effective control variate if 
it can be easily priced. Indeed, the calculation of the optimal coefficient b* is 
identical to the calculation of the optimal hedge ratio in minimum-variance 
hedging (see, e.g., Section 2.9 of Hull [189]). 

Whereas a static hedge (a fixed position in another asset) may provide 
significant variance reduction, a dynamic hedge can, in principle, remove all 
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variance — at least under the assumptions of a complete market and continu- 
ous trading as discussed in Section 1.2.1. Using the notation of Section 1.2.1, 
let V(t) denote the value at time t of a price process that can be replicated 
through a self-financing trading strategy applied to a set of underlying assets 
Sj, j =1,...,d. As in Section 1.2.1, under appropriate conditions we have 


V(T) -vo+f DA O e 


q=1 


In other words, V is replicated through a delta-hedging strategy that holds 
OV/OS; shares of asset S; at each instant. This suggests that V (T) should be 
highly correlated with 


m d 
Yo EY se) - sta) 4.9) 


where 0 = to < ti <-:: < tm = T; this is a discrete-time approximation 
to the dynamic hedging strategy. Of course, in practice if V(t) is unknown 
then its derivatives are likely to be unknown. One may still obtain an effective 
control variate by using a rough approximation to the OV/OS,; for example, 
one might calculate these deltas as though the underlying asset prices followed 
geometric Brownian motions, even if their actual dynamics are more complex. 
See Clewlow and Carverhill [87] for examples of this. 

Using an expression like (4.9) as a control variate is somewhat similar to 
using all the increments S';(¢;) — S;(t;-1) as controls or, more conveniently, 
the increments of the discounted asset prices since these have mean zero. The 
main difference is that the coefficients OV/OS; in (4.9) will not in general be 
constants but will depend on the S; themselves. We may therefore interpret 
(4.9) as using the S;(t;) — S;(t;-1) as nonlinear control variates. We discuss 
nonlinear controls more generally in Section 4.1.4. O 


Example 4.1.6 Primitive controls. In the examples above, we have stressed 
the use of special features of derivative pricing models in identifying potential 
control variates. Indeed, significant variance reduction is usually achieved only 
by taking advantage of special properties of a model. It is nevertheless worth 
mentioning that many generic (and thus typically not very effective) control 
variates are almost always available in a simulation. For example, most of 
the models discussed in Chapter 3 are simulated from a sequence Z1, Zo,... 
of independent standard normal random variables. We know that E/Z,] = 0 
and Var|Z;| = 1, so the sample mean and sample variance of the Z; are 
available control variates. At a still more primitive level, most simulations 
start from a sequence Uj, U2,... of independent Unif[0,1] random variables. 
Sample moments of the U; can also be used as controls. O 


Later in this chapter we discuss other techniques for reducing variance. In 
a sense, all of these can be viewed as strategies for selecting control variates. 
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For suppose we want to estimate E[Y] and in addition to the usual sample 
mean Y we have available an alternative unbiased estimator Y. The difference 
(Y —Y) has (known) expectation zero and can thus be used to form a control 
variate estimate of the form 


Y —b(Y =Y). 


The special cases b = 0 and b = 1 correspond to using just one of the two 
estimators; by optimizing b, we obtain a combined estimator that has lower 
variance than either of the two. 


Output Analysis 


In analyzing variance reduction techniques, along with the effectiveness of a 
technique it is important to consider how the technique affects the statisti- 
cal interpretation of simulation outputs. So long as we deal with unbiased, 
independent replications, computing confidence intervals for expectations is 
a simple matter, as noted in Section 1.1 and explained in Appendix A. But 
we will see that some variance reduction techniques complicate interval esti- 
mation by introducing dependence across replications. This issue arises with 
control variates if we use the estimated coefficient 6, in (4.5). It turns out 
that in the case of control variates the dependence can be ignored in large 
samples; a more careful consideration of small-sample issues will be given in 


Section 4.1.3. i 
For any fixed b, the control variate estimator Y (b) in (4.1) is the sample 


mean of independent replications Y;(b), i = 1,...,n. Accordingly, an asymp- 
totically valid 1 — 6 confidence interval for E[Y] is provided by 

z b 

OERAL ) (4.10) 


vn’ 
where 25/2 is the 1 — 6/2 quantile from the normal distribution (®(z5/2) = 


1 — 6/2) and a(b) is the standard deviation per replication, as in (4.2). 
In practice, o(b) is typically unknown but can be estimated using the 


sample standard deviation 


The confidence interval (4.10) remains asymptotically valid if we replace a(b) 
with s(b), as a consequence of the limit in distribution 
Y (b) — EY] 
o(b)//n 
and the fact that s(b)/o(b) — 1; see Appendix A.2. If we use the estimated 
coefficient bn in (4.5). then the estimator 


= N(0,1) 
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“sa - E[X])) 


is not quite of the form Y(b) because we have replaced the constant b with 
the random quantity bn. Nevertheless, because ôn — b* , we have 


Vn¥ (bn) — ¥(b*)) = (bn — B*) - vVn(X — E[X]) = 0: N(0, 0%) = 0, 


SO Y (bn ) satisfies the same central limit theorem as Y(b*). This means that 
Y (bn ) is asymptotically as precise as Y(b*). Moreover, the central limit theo- 
rem applies in the form 


sie 


= N(0,1), 


with s(n) the sample standard deviation of the Y; (bn), i=1,...,n, because 
s(bn)/o(b*) — 1. In particular, the confidence interval (4.10) remains asymp- 
totically valid if we replace Y (b) and c (b) with Y (bn) and s(b,), and confidence 
intervals estimated using bn are asymptotically no wider than confidence in- 
tervals estimated using the optimal coefficient b*. 

We may summarize this discussion as follows. It is a simple matter to 
estimate asymptotically valid confidence intervals for control variate estima- 
tors. Moreover, for large n, we get all the benefit of the optimal coefficient 
b* by using the estimate bn. However, for finite n, there may still be costs to 
using an estimated rather than a fixed coefficient; we return to this point in 


Section 4.1.3. 


4.1.2 Multiple Controls 


We now generalize the method of control variates to the case of multiple 
controls. Examples 4.1.1—4.1.6 provide ample motivation for considering this 


extension. 

o If e~™ S(t) is a martingale, then e~™ S(t1),...,e74S (tq) all have expec- 
tation S(0) and thus provide d controls on each path. 

o If the simulation involves d underlying assets, the terminal values of all 


assets may provide control variates. 
o Rather than use a single option as a control variate, we may want to use 


options with multiple strikes and maturities. 
o In an interest rate simulation, we may choose to use d bonds of different 


maturities as controls. 
Suppose, then, that each replication 2 of a simulation produces outputs 
Y; and X; = (o TE, S ye and suppose the vector of expectations ELX] 
is known. We assume that the pairs (X;,Y;), 7 = 1,...,n, are iid. with 
covariance matrix 
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( >x a (4.11) 


= 
diyy OY 


Here, Ux is d x d, .xy is d x 1, and, as before the scalar oĉ is the variance 
of the Y;. We assume that © x is nonsingular; otherwise, some X*) is a linear 
combination of the other Xs and may be removed from the set of controls. 

Let X denote the vector of sample means of the controls. For fixed b € R$, 
the control variate estimator Y(b) is 


Y(b) = Y —b'(X — E[X)). 
Its variance per replication is 
Var[Y; — b! (X: — ELX])] = 02 —2b'Xxy +b! Uxxd. (4.12) 
This is minimized at 
bee a (4.13) 


As in the case of a single control variate, this is also the slope (more precisely, 
the vector of coefficients) in a regression of Y against X. 
As is customary in regression analysis, define 


he, eo (4.14) 


this generalizes the squared correlation coefficient between scalar X and Y and 
measures the strength of the linear relation between the two. Substituting b* 
into the expression for the variance per replication and simplifying, we find 
that the minimal variance (that is, the variance of Y;(b*)) is 


of — Uy hy Exy = (1- BR? )o?d. (4.15) 


Thus, R? measures the fraction of the variance of Y that is removed in opti- 


mally using X as a control. 
In practice, the optimal vector of coefficients b* is unknown but may be 


estimated. The standard estimator replaces Xx and xy in (4.13) with their 
sample counterparts to get 


bn = Be Oxy, (4.16) 


where Sx is the d x d matrix with jk entry 


n 


1 | oe 
(>: Oe ae nzo) (4.17) 


n—l1 \< 
i=l 


and Sxy is the d-vector with 7th entry 


1 ee 
a (Sa xy -nOr 
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The number of controls d is ordinarily not very large so size is not an obstacle 
in inverting Sx, but if linear combinations of some of the controls are highly 
correlated this matrix may be nearly singular. This should be considered in 
choosing multiple controls. 

A simple estimate of the variance of Y(b,) is provided by sn/y/n where 
Sn is the sample standard deviation of the adjusted replications 


Yi(bn) = Yi — by (Xi — E[X]). 


The estimator s, ignores the fact that b, is itself estimated from the repli- 
cations, but it is nevertheless a consistent estimator of cy (b*), the optimal 
standard deviation in (4.15). An asymptotically valid 1—6 confidence interval 
is thus provided by 

Si 5 

Y (bn) £ 25/2 —=. 4.18 

(bn) E za E (4.18) 

The connection between control variates and regression analysis suggests an 
alternative way of forming confidence intervals; under additional assumptions 
about the joint distribution of the (X;, Y;) the alternative is preferable, espe- 
cially if n is not very large. We return to this point in Section 4.1.3. 


Variance Decomposition 


In looking for effective control variates, it is useful to understand what part of 
the variance of an ordinary estimator is removed through the use of controls. 
We now address this point. 

Let (X,Y) be any random vector with Y scalar and X d-dimensional. Let 
(X,Y) have the partitioned covariance matrix in (4.11). For any b, we can 
write 


Y =E/Y])+0'(X -EX +e, 


simply by defining e so that equality holds. If b = b* = Bp} xy, then in fact 
e is uncorrelated with X; i.e., Y — b*' (X — E[X]) is uncorrelated with X so 


Var[Y] = Var[b*' X] + Var[e] = Var[b*' X] + Var[Y — b*! X]. 


In this decomposition, the part of Var[Y] eliminated by using X as a control 
is Var[b*' X] and the remaining variance is Var[e]. 

The optimal vector b* makes b*! (X — E[X]) the projection of Y — E[Y] 
onto X — E[X]; the residual € may be interpreted as the part of Y — E[Y] 
orthogonal to X — E[X], orthogonal here meaning uncorrelated. The smaller 
this orthogonal component (as measured by its variance), the greater the 
variance reduction achieved by using X as a control for Y. If, in particular, 
Y is a linear combination of the components of X, then using X as a control 
eliminates all variance. Of course, in this case E[Y] is a linear combination of 
the (known) components of E|X], so simulation would be unnecessary. 
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Consider, again, the examples with which we opened this section. If we 
use multiple path values S(t;) of an underlying asset as control variates, we 
eliminate all variance in estimating the expected value of any instrument 
whose payoff is a linear combination of the S(t,). (In particular, each E[S(¢,)] 
is trivially estimated without error if we use S'(t;) as a control.) The variance 
that remains in estimating an expected payoff while using the S(t;) as controls 
is attributable to the part of the payoff that is uncorrelated with the S(t;). 
Similarly, if we use bond prices as control variates in pricing an interest rate 
derivative, the remaining variance is due to the part of the derivative’s payoff 
that is uncorrelated with any linear combination of bond prices. 


Control Variates and Weighted Monte Carlo 


In introducing the idea of a control variate in Section 4.1.1, we explained that 
the observed error in estimating a known quantity could be used to adjust 
an estimator of an unknown quantity. But the technique has an alternative 
interpretation as a method for assigning weights to replications. ‘This alterna- 
tive perspective is sometimes useful, particularly in relating control variates 
to other methods. 

For simplicity, we start with the case of a single control; thus, Y; and X; 
are scalars and the pairs (X;, Y;) are i.i.d. The control variate estimator with 
estimated optimal coefficient by is Y Ôn) = Y — bn (X —E[X]). As in (4.5), the 
estimated coefficient is given by 


= ie (Xi — X)(% -Y) 


By substituting this expression into Y (bn) and simplifying, we arrive at 


sp) yn SE | = 
(0) = — + >r M N = Wi Yj. 4.19 
nd So + Se Oe a wk. a9 
In other words, the control variate estimator is a weighted average of the 
replications Y,..., Yn. The weights w; are completely determined by the ob- 
servations X1,..., Xn of the control. 

A similar representation applies with multiple controls. Using the esti- 
mated vector of coefficients in (4.16), the sample covariance matrix Sx in 


(4.17) and simplifying, we get 


ws ie il 1 | z 
Ylona) = — + — (Š — X,)'sy"(X — E[X]) ] Y; 4.20 
fn) =D (+ ag — XS ED) (4.20) 
which is again a weighted average of the Y;. Here, as before, X; denotes the 
vector of controls (X a? REA, 2 )" from the ith replication, and X and EX] 


Adamanta tha aamnla maaan ann avnantatinn af tha Xy recneactively 
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The representation in (4.20) is a special case of a general feature of regres- 
sion — namely, that a fitted value of Y is a weighted average of the observed 
values of the Y; with weights determined by the X;. One consequence of this 
representation is that if we want to estimate multiple quantities (e.g., prices 
of various derivative securities) from the same set of simulated paths using 
the same control variates, the weights can be calculated just once and then 
applied to all the outputs. Hesterberg and Nelson [178] also show that (4.20) 
is useful in applying control variates to quantile estimation. They indicate 
that although it is possible for some of the weights in (4.19) and (4.20) to 
be negative, the probability of negative weights is small in a sense they make 


precise. 


4.1.3 Small-Sample Issues 


In our discussion (following (4.10)) of output analysis with the method of 
control variates, we noted that because bn converges to 6*, we obtain an 
asymptotically valid confidence interval if we ignore the randomness of bn 
and the dependence it introduces among Y; (bn), i = 1,...,n. Moreover, we 
noted that as n — oo, the variance reduction achieved using ôn approaches 
the variance reduction that would be achieved if the optimal b* were known. 

In this section, we supplement these large-sample properties with a dis- 
cussion of statistical issues that arise in analyzing control variate estimators 
based on a finite number of samples. We note that stronger distributional 
assumptions on the simulation output lead to confidence intervals valid for 
all n. Moreover, it becomes possible to quantify the loss in efficiency due to 
estimating b*. This offers some guidance in deciding how many control vari- 
ates to use in a simulation. This discussion is based on results of Lavenberg, 


Moeller, and Welch [221] and Nelson [277]. i 
For any fixed b, the control variate estimator Y (b) is unbiased. But using 


A 


bn, we have 
Bias(Y (6n)) = E[Y (bn)] — E[Y] = -E [b] (X — E[X])], 


which need not be zero because b, and X are not independent. A simple way to 
eliminate bias is to use nı replications to compute an estimate bn, and to then 
apply this coefficient with the remaining n — n; replications of (X;, Y;). This 
makes the coefficient estimate independent of X and thus makes E[b! AS 
E[ô,! JELX |. In practice, the bias produced by estimating b* is usually small so 
the cost of estimating coefhicients through separate replications is unattractive. 
Indeed, the bias is typically O(1/n), whereas the standard error is O(1/,/n). 

Lavenberg, Moeller, and Welch [221] and Nelson [277] note that even if 
bn is estimated from the same replications used to compute Y and X, the 
control variate estimator is unbiased if the regression of Y on X is linear. 


More precisely, if 
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EY |X] = cote XY +4---+¢e,X for some constants co, C1,.--, Ca, (4.21) 
then E[Y (b,)] = E[Y]. 

Under the additional assumption that Var[Y|X] does not depend on X, 

Nelson [277] notes that an unbiased estimator of the variance of Y (bn), n > 


d —1, is provided by 


ên = (= S 1% = P bn) = bn (Xi — en) 


The first factor in this expression is the sample variance of the regression 
residuals, the denominator n — d — 1 reflecting the loss of d + 1 degrees of 
freedom in estimating the regression coefficients in (4.21). The second factor 
inflates the variance estimate when X is far from E[X’]. 

As in Lavenberg et al. [221] and Nelson [277], we now add a final as- 
sumption that (X,Y) has a multivariate normal distribution, from which two 
important consequences follow. The first is that this provides an exact confi- 
dence interval for E[Y] for all n: the interval 


Y (bn) + tr—a—1,6/28n (4.22) 


covers E[Y] with probability 1 — 6, where t,_g_1,5/2 denotes the 1 — 6/2 
quantile of the t distribution with n —d—1 degrees of freedom. This confidence 
interval may have better coverage than the crude interval (4.18), even if the 
assumptions on which it is based do not hold exactly. 

A second important conclusion that holds under the added assumption of 
normality is an exact expression for the variance of Y (ôn). With d controls 
and n > d + 2 replications, 


Val Gb eG payee, (4.23) 


Here, as before, a? = Var[Y] is the variance per replication without controls 
and R? is the squared multiple correlation coefficient defined in (4.14). As 
noted in (4.15), (1 — R*)o% is the variance per replication of the control 
variate estimator with known optimal coefficient. We may thus write (4.23) 


as 


ae. n—2 eee 
VarlY (bn)| = T 5 VarlY (b )]. (4.24) 
In light of this relation, Lavenberg et al. [221] call (n — 2)/(n — d — 2) the 
loss factor measuring the loss in efficiency due to using the estimate În rather 
than the exact value 6*. 

Both (4.22) and (4.24) penalize the use of too many controls — more 
precisely, they penalize the use of control variates that do not provide a suf- 
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degrees of freedom, a larger multiplier ¢,_4-1,5/2, and thus a wider confidence 
interval unless the increase in d is offset by a sufficient decrease in §,,. In (4.24), 
a larger d results in a larger loss factor and thus a greater efficiency cost from 
using estimated coefficients. In both cases, the cost from using more controls 
is eventually overwhelmed by increasing the sample size n; what constitutes 
a reasonable number of control variates thus depends in part on the intended 
number of replications. 

The validity of the confidence interval (4.22) and the loss factor in (4.24) 
depends on the distributional assumptions on (X,Y ) introduced above leading 
up to (4.22); in particular, these results depend on the assumed normality of 
(X,Y). (Loh [239] provides extensions to more general distributions but these 
seem difficult to use in practice.) In pricing applications, Y would often be the 
discounted payoff of an option contract and thus highly skewed and distinctly 
non-normal. In this case, application of (4.22) and (4.24) lacks theoretical 
support. 

Nelson [277] analyzes the use of various remedies for control variate estima- 
tors when the distributional assumptions facilitating their statistical analysis 
fail to hold. Among the methods he examines is batching. This method groups 
the replications (X;,Y;), i = 1,...,n, into k disjoint batches of n/k replica- 
tions each. It then calculates sample means of the (X;, Y;) within each batch 
and applies the usual control variate procedure to the k sample means of the 
k batches. The appeal of this method lies in the fact that the batch means 
should be more nearly normally distributed than the original (X;, Y;). The 
cost of batching lies in the loss of degrees of freedom: it reduces the effective 
sample size from n to k. Based on a combination of theoretical and exper- 
imental results, Nelson [277] recommends forming 30 to 60 batches if up to 
five controls are used. With a substantially larger number of controls, the cost 
of replacing the number of replications n with k = 30-60 in (4.22) and (4.24) 
would be more significant; this would argue in favor of using a larger number 
of smaller batches. 

Another strategy for potentially improving the performance of control vari- 
ate estimators replaces the estimated covariance matrix Sx with its true value 
“ix in estimating b*. This is feasible if Xx is known, which would be the case 
in at least some of the examples introduced in Section 4.1.1. Nelson [277] and 
Bauer, Venkatraman, and Wilson [40] analyze this alternative; perhaps sur- 
prisingly, they find that it generally produces estimators inferior to the usual 


method based on bņ. 


4.1.4 Nonlinear Controls 


Our discussion of control variates has thus far focused exclusively on linear 
controls, meaning estimators of the form 


Y —b'(X —E[X]), (4.25) 
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with the vector 6 either known or estimated. There are, however, other ways 
one might use the discrepancy between X and E[X] to try to improve the 
estimator Y in estimating E[Y]. For example, in the case of scalar X, the 
estimator 
> E[X] 

adjusts Y upward if 0 < X < ELX], downward if 0 < E[X] < X, and thus may 
be attractive if X; and Y; are positively correlated. Similarly, the estimator 
YX/E[X] may have merit if the X; and Y; are negatively correlated. Other 
estimators of this type include 


Y exp (X —E[X]) and PED 


In each case, the convergence of X to E[X] ensures that the adjustment to 
Y vanishes as the sample size increases, just as in (4.25). But for any finite 
number of replications, the variance of the adjusted estimator could be larger 
or smaller than that of Y. 

These are examples of nonlinear control variate estimators. They are all 
special cases of estimators of the form A(X, Y ) for functions h satisfying 


A(E[X], y) = y for all y. 


The difference between the controlled estimator h(X,Y) and Y thus depends 
on the deviation of X from E[X]. 

Although the introduction of nonlinear controls would appear to substan- 
tially enlarge the class of candidate estimators, it turns out that in large 
samples, a nonlinear control variate estimator based on a smooth h is equiv- 
alent to an ordinary linear control variate estimator. This was demonstrated 
in Glynn and Whitt [159], who note a related observation in Cheng and Feast 
[84]. We present the analysis leading to this conclusion and then discuss its 


implications. 


Delta Method 


The main tool for the large-sample analysis of nonlinear control variate esti- 
mators is the delta method. This is a result providing a central limit theorem 


for functions of sample means. To state it generally, we let €;,7 = 1,2,... be 
i.i.d. random vectors in RE with mean vector p and covariance matrix X. The 
sample mean € of &1,...,&, satisfies the central limit theorem 


vnl — H] ae N(0, X). 


Now let h : RE — R be continuously differentiable in a neighborhood of u and 
suppose the partial derivatives of h at u are not all zero. For sufficiently large 
n, a Taylor approximation gives 


h(E) = A(p) + Vh(én)[E — u, 
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with Vh the gradient of h (a row vector) and ¢, a point on the line segment 
joining u and £. Asn > œ, € — u and thus ¢, —> u as well; continuity of the 
gradient implies Vh(C,) — VA(u). Thus, for large n, the error h(€) — h(u) 
is approximately the inner product of the constant vector Vh(u) and the 
asymptotically normal vector € — jz, and is itself asymptotically normal. More 
precisely, 
Valh(é) — h(y)] + N(0, Va(wEVA(u)"). (4.26) 


See also Section 3.3 of Serfling [326], for example. 


For the application to nonlinear controls, we replace €; with (X;, Yi), u 
with (E[X], E[Y]), and X with 
sa ( >x Uxy 
xy oF J’ 
the covariance matrix in (4.11). From the delta method, we know that the 
nonlinear control variate estimator is asymptotically normal with 


Vn[h(X,Y) — E[Y]] = N(0, 0%), 


(recall that R(E[X], E[Y]) = E[Y]) and 


on\* Oh 
iS (5) G45? (5) VehUxy +VrhExVrh', 
with V,h denoting the gradient of A with respect to the elements of X and 
with all derivatives evaluated at (ELX], E[Y]). Because h(ELX],-) is the iden- 
tity, the partial derivative of h with respect to its last argument equals 1 at 
(E[X], E[Y]), so 


oh = oY +H2Vzh% xy s4 Vahd Noh 


But this is precisely the variance of 
Y; — b' (X; — E[X]) 


with b = —Vzh(E[X], E[Y]); see (4.12). Thus, the distribution of the nonlinear 
control variate estimator using X is asymptotically the same as the distrib- 
ution of an ordinary linear contròl variate estimator using X and a specific 
vector of coefficients b. In particular, the limiting variance parameter o7 can 
be no smaller than the optimal variance that would be derived from using the 
optimal vector b*. 

A negative reading of this result leads to the conclusion that nonlinear con- 
trols add nothing beyond what can be achieved using linear controls. A some- 
what more positive and more accurate interpretation would be that whatever 
advantages a nonlinear control variate estimator may have must be limited to 
small samples. “Small” may well include all relevant sample sizes in specific 
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applications. ‘The delta method tells us that asymptotically only the linear 
part of h matters, but if A is highly nonlinear a very large sample may be 
required for this asymptotic conclusion to be relevant. For fixed n, each of the 
examples with which we opened this section may perform rather differently 
from a linear control. 

It should also be noted that in the linear control variate estimator to which 
any nonlinear control variate estimator is ultimately equivalent, the coefficient 
b is implicitly determined by the function h. In particular, using a nonlinear 
control does not entail estimating this coefficient. In some cases, a nonlinear 
control may be effective because —V,h is close to optimal but need not be 


estimated. 


4.2 Antithetic Variates 


The method of antithetic variates attempts to reduce variance by introduc- 
ing negative dependence between pairs of replications. The method can take 
various forms; the most broadly applicable is based on the observation that 
if U is uniformly distributed over [0,1], then 1 — U is too. Hence, if we gen- 
erate a path using as inputs U1,...,Un, we can generate a second path using 
1~-—U,,...,1—U, without changing the law of the simulated process. The 
variables U; and 1 — U; form an antithetic pair in the sense that a large value 
of one is accompanied by a small value of the other. ‘This suggests that an 
unusually large or small output computed from the first path may be balanced 
by the value computed from the antithetic path, resulting in a reduction in 
variance. 

These observations extend to other distributions through the inverse trans- 
form method: F'~!(U) and F~1(1 — U) both have distribution F but are an- 
tithetic to each other because F'~! is monotone. For a distribution symmetric 
about the origin, F~'(1 — u) and F'~!(u) have the same magnitudes but op- 
posite signs. In particular, in a simulation driven by independent standard 
normal random variables, antithetic variates can be implemented by pairing a 
sequence Z1, Z2,... of i.i.d. N(0,1) variables with the sequence —Z1,—Zo,... 
of iid. N(0,1) variables, whether or not they are sampled through the in- 
verse transform method. If the Z; are used to simulate the increments of a 
Brownian path, then the —Z; simulate the increments of the reflection of the 
path about the origin. ‘This again suggests that running a pair of simulations 
using the original path and then its reflection may result in lower variance. 

To analyze this approach more precisely, suppose our objective is to esti- 
mate an expectation E/Y] and that using some implementation of antithetic 
sampling produces a sequence of pairs of observations (Yi, Y1), (Y2, Y2), ..., 
(Yn, Yn). The key features of the antithetic variates method are the following: 


o the pairs (Y, Yi), (Y2, Y2), oa., (Yn, Yn) are i.i.d.; 
o for each 2, Y; and Y; have the same distribution, though ordinarily they are 
not independent. 
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We use Y generically to indicate a random variable with the common distri- 


bution of the Y; and Y;. 
The antithetic variates estimator is simply the average of all 2n observa- 


tions, 
O A bs Sgh iS (+N 
Yav = — Y, Yope : 4.27 
oaeiai) e 


i 


The rightmost representation in (4.27) makes it evident that Yay is the sample 
mean of the n independent observations 


Vey, Yo + ¥: Y,4¢Y,, 
(=) | (55) S (5) | (4.28) 


The central limit theorem therefore applies and gives 


Yav — E[Y] 


with 


Y; +Y; 
siv = Var | = ar 
2 
As usual, this limit in distribution continues to hold if we replace cay with 
sav, the sample standard deviation of the n values in (4.28). This provides 
asymptotic justification for a 1 — ô confidence interval of the form 


; SAV 
Yav E 25/2 va 


where 1 — ®(z5/2) = 6/2. 

Under what conditions is an antithetic variates estimator to be preferred 
to an ordinary Monte Carlo estimator based on independent replications? To 
make this comparison, we assume that the computational effort required to 
generate a pair (Yj, Y;) is approximately twice the effort required to generate 
Y;. In other words, we ignore any potential computational savings from, for 
example, flipping the signs of previously generated Z1, Zo,... rather than 
generating new normal variables. This is appropriate if the computational cost 
of generating these inputs is a small fraction of the total cost of simulating Y;. 
Under this assumption, the effort required to compute Yay is approximately 
that required to compute the sample mean of 2n independent replications, and 
it is therefore meaningful to compare the variances of these two estimators. 


Using antithetics reduces variance if 
1 2n 
Var [Pav < Var È Sri ; 


i.e., if 


4.2 Antithetic Variates 207 


Var Yi + ¥,| < 2Var[Y;]. 
The variance on the left can be written as 
Var kz + Fi = Var[Y;] + Var[Y;] + 2Cov[Y;, Yi] 
= 2Var[Y;] + 2Cov[Y;, Yi], 


using the fact that Y; and Y; have the same variance if they have the same 
distribution. Thus, the condition for antithetic sampling to reduce variance 


becomes ; 
Cov [Y| <0. (4.29) 


Put succinctly, this condition requires that negative dependence in the 
inputs (whether U and 1 — U or Z and —Z) produce negative correlation 
between the outputs of paired replications. A simple sufficient condition en- 
suring this is monotonicity of the mapping from inputs to outputs defined by 
a simulation algorithm. To state this precisely and to give a general formu- 
lation, suppose the inputs to a simulation are independent random variables 
X1,.-.-.,Xm. Suppose that Y is an increasing function of these inputs and Y 
is a decreasing function of the inputs; then 


~ 
t 


EIYY] < E/YJE[Y]. 


This is a special case of more general properties of associated random vari- 
ables, in the sense of Esary, Proschan, and Walkup [113]. Observe that if 
Y = f(Ui,...,Ua) or Y = f(Z1,..., Za) for some increasing function f, then 
Y=/f1-U,,...,1—Ua) and Y = f(—Z1,...,—Za) are decreasing functions 
of (U1,...,Ua) and (4,..., Za), respectively. The requirement that the sim- 
ulation map inputs to outputs monotonically is rarely satisfied exactly, but 
provides some qualitative insight into the scope of the method. 

The antithetic pairs (U,1 — U) with U ~ Unif[0,1] and (2, —Z) with Z ~ 
N(0, 1) share an additional relevant property: in each case, the average of the 
paired values is the population mean, because 


a) Z+(-Z) _ 


=1/2 and 0. 

2 (a 2 

It follows that if the output Y is a linear function of inputs (Ui,...,Uqa) or 
(Z1,...,Zq), then antithetic sampling results in a zero-variance estimator. Of 


course, in the linear case simulation would be unnecessary, but this observation 
suggests that antithetic variates will be very effective if the mapping from 
inputs to outputs is close to linear. 


Variance Decomposition 


Antithetic variates eliminate the variance due to the antisymmetric part of an 
integrand, in a sense we now develop. For simplicity, we restrict attention to 
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the case of standard normal inputs, but our observations apply equally well 
to any other distribution symmetric about the origin and apply with minor 
modifications to uniformly distributed inputs. 

Suppose, then, that Y = f(Z) with Z = (Z1,..., Za) ~ N(0,J). Define 
the symmetric and antisymmetric parts of f, respectively, by 

Z) ip lz re aoa Bs a2 
f(y = fOHIC® goa py fate) 
2 2 

Clearly, f = fo + f1; moreover, this gives an orthogonal decomposition of f 
in the sense that fo(Z) and fı(Z) are uncorrelated: 


E[fo(Z)fu(Z)] = ELf(Z) - #(-2)) 
= 0 
= Elfo(Z)IE[fi(2)]. 
It follows that 
Varl f(Z)] = Var[fo(Z)] + Var[fi(Z)]. (4.30) 


The first term on the right is the variance of an estimate of E[f(Z)] based on 
an antithetic pair (Z, —Z). Thus, antithetic sampling eliminates all variance 
if f is antisymmetric (f = fı) and it eliminates no variance if f is symmetric 
(f = fo). 

Fox |127] advocates the use of antithetic sampling as the first step of a more 
elaborate framework, in order to eliminate the variance due to the linear (or, 
more generally, the antisymmetric) part of f. 


Systematic Sampling 


Antithetic sampling pairs a standard normal vector Z = (Z1,..., Za) with its 
reflection —Z = (—Z,...,—Za), but it is natural to consider other vectors 
formed by changing the signs of the components of Z. Generalizing still further 
leads us to consider transformations T : Rİ — Rİ (such as multiplication 
by an orthogonal matrix) with the property that TZ ~ N(0,I) whenever 
Z ~ N(0,JI). This property implies that the iterated transformations T?Z, 
T°Z,... will also have standard normal distributions. Suppose that T* is 
the identity for some k. The usual antithetic transformation has k = 2, but 
by considering other rotations and reflections of R%, it is easy to construct 
examples with larger values of k. 
Define 


k 
fol) =— Do ITZ) and file) = f(e) ~ fole). 


We clearly have E/fo(Z)| = E[f(Z)]. The estimator fo(Z) generalizes the an- 
tithetic variates estimator; in the survey sampling literature, methods of this 
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type are called systematic sampling because after the initial random drawing 
of Z, the k — 1 subsequent points are obtained through deterministic trans- 


formations of Z. 
The representation f(Z) = fo(Z) + fi(Z) again gives an orthogonal de- 
composition. To see this, first observe that 


k 
m2 MTT Z 


| 
Rabe 


E[fo(Z)7] = 


iy 


| = 
x 


| 


, 2 [f(T'Z) : fo(T"Z)| 


= Elf(Z) fo(Z)], 


SO 
E[fo( 4) fi (4)] = Elfo(Z) (F(Z) — fo(Z))] = 

Thus, (4.30) continues to hold under the new definitions of fo and fı. Assum- 

ing the f(7*Z), i = 1,...,k, require approximately equal computing times, 

the estimator fo(Z) beats ordinary Monte Carlo if 


Var| fo(Z)] < -Var[ F(Z) 


The steps leading to (4.29) generalize to the requirement 
3 Cov[f(Z), f(TİZ)] < 0. 


This condition is usually at least as difficult to satisfy as the simple version 
(4.29) for ordinary antithetic sampling. 

For a more general formulation of antithetic sampling and for historical 
remarks, see Hammersley and Handscomb [169]. Boyle [52] is an early ap- 
plication in finance. Other work on antithetic variates includes Fishman and 
Huang [122] and Rubinstein, Samorodnitsky, and Shaked [312]. 


4.3 Stratified Sampling 


4.3.1 Method and Examples 


Stratified sampling refers broadly to any sampling mechanism that constrains 
the fraction of observations drawn from specific subsets (or strata) of the 
sample space. Suppose, more specifically, that our goal is to estimate EY] 
with Y real-valued, and let A;,..., Ax be disjoint subsets of the real line for 


which P(Y € U;A;) = 1. Then 
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K K 
= ÑO P(Y € AE[Y]|Y € Ai] = XO piE[Y|Y € Ai] (4.31) 


i=1 i=1 


with pp = P(Y e€ A;). In random sampling, we generate independent 
Yi,..., Yn having the same distribution as Y. The fraction of these samples 
falling in A; will not in general equal p;, though it would approach p; as the 
sample size n increased. In stratified sampling, we decide in advance what 
fraction of the samples should be drawn from each stratum A,; each observa- 
tion drawn from A; is constrained to have the distribution of Y conditional 
on Y € Aj. 

The simplest case is proportional sampling, in which we ensure that the 
fraction of observations drawn from stratum A; matches the theoretical prob- 
ability p; = P(Y € Ai). If the total sample size is n, this entails generating 
Ni = np; samples from A;. (To simplify the discussion, we ignore rounding 
and assume np, is an integer instead of writing |np;|.) For each i = 1,...,K, 
let Yi;, 7 =1,...,n; be independent draws from the conditional distribution 
of Y given Y € A;. An unbiased estimator of E[Y|Y € A;] is provided by the 
sample mean (Yj; + -+ + Yin,)/n; of observations from the ith stratum. It 
follows from (4.31) that an unbiased estimator of E[Y] is provided by 


K ni 


re És TE DDA (4.32) 


oa 1 


This estimator should be contrasted with the usual sample mean Y = (Y; + 
-.. + Yn)/n of a random sample of size n. Compared with Y, the stratified 
estimator Y eliminates sampling variability across strata without affecting 
sampling variability within strata. 

We generalize this formulation in two simple but important ways. First, 
we allow the strata to be defined in terms of a second variable X. This stratifi- 
cation variable could take values in an arbitrary set; to be concrete we assume 
it is R?-valued and thus take the strata A; to be disjoint subsets of Rİ with 
P(X € U;A;) = 1. The representation (4.31) generalizes to 


K K 
EY] = X| P(X € A)E[Y|X € Ai] = X p:E[Y|X € Ail, (4.33) 


i=1 


where now p; = P(X € A;). In some applications, Y is a function of X (for 
example, X may be a discrete path of asset prices and Y the discounted payoff 
of a derivative security), but more generally they may be dependent without 
either completely determining the other. To use (4.33) for stratified sampling, 
we need to generate pairs (Xij, Yiz), 9 = 1,..., ni, having the conditional 
distribution of (X,Y) given X € Aj. 

As a second extension of the method, we allow the stratum allocations 
mi,...,nK to be arbitrary (while summing to n) rather than proportional to 
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pi,---;pK- In this case, the first representation in (4.32) remains valid but 
the second does not. If we let q; = n;/n be the fraction of observations drawn 
from stratum 7,7=1,...,A, we can write 


Ni K ni 


K 
y= p= aM z3 2. (4.34) 


By minimizing the variance of this estimator over the q;, we can find an 
allocation rule that is at least as effective as a proportional allocation. We 
return to this point later in this section. 

From this introduction it should be clear that the use of stratified sampling 
involves consideration of two issues: 


o choosing the stratification variable X, the strata A;,..., Ax, and the allo- 


cation N1,...,2K;3 
o generating samples from the distribution of (X, Y ) conditional on X € Aj. 


In addressing the first issue we will see that stratified sampling is most effective 
when the variability of Y within each stratum is small. Solutions to the second 
issue are best illustrated through examples. 


Example 4.3.1 Stratifying uniforms. Perhaps the simplest application of 
stratified sampling stratifies the uniformly distributed random variables that 
drive a simulation. Partition the unit interval (0,1) into the n strata 


Aps (0.3 vA (<=), ae (=a). 
n nn n 
Each of these intervals has probability 1/n under the uniform distribution, so 
in a proportional allocation we should draw one sample from each stratum. 
(The sample size n and the number of strata K are equal in this example.) Let 
U,,...,U, be independent and uniformly distributed between 0 and 1 and let 


oe ee oe 
Vise +—, ti=1,...,n. (4.35) 
n n 


Each V; is uniformly distributed between (i — 1)/n and i/n, which is to say 
that V; has the conditional distribution of U given U € A; for U ~ Unif[0,1]. 
Thus, Vi,..., Vn constitute a stratified sample from the uniform distribution. 
(In working with the unit interval and the subintervals A;, we are clearly free 
to define these to be open or closed on the left or right; in each setting, we 
adopt whatever convention is most convenient.) 

Suppose Y = f(U) so that E[Y] is simply the integral of f over the unit 
interval. Then the stratified estimator 


‘ Le 
Pa Ao 
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is similar to the deterministic midpoint integration rule 
lw, (2-1 
pos 
as 2N 
t= 


based on the value of f at the midpoints of the A;. A feature of the random- 
ization in the stratified estimator is that it makes Y unbiased. 

This example easily generalizes to partitions of (0,1) into intervals of un- 
equal lengths. If A; = (a;,6;], then the conditional distribution of U given 
U € A; is uniform between a; and b;; we can sample from this conditional 
distribution by setting V = a; + U (bi —a;). O 


Example 4.3.2 Stratifying nonuniform distributions. Let F be a cumulative 
distribution function on the real line and let 


F~*(u) =inf{x: F(x) < u} 


denote its inverse as defined in Section 2.2.1. Given probabilities p1,...,pK 
summing to 1, define ag = —co, 
a, = F~*(pi), ag = F+ (pi + pa), ..., ag = FO (pi +- + pK) = F (1). 


Define strata 
Aj = (ao, a1], Ag = (a1, ao], ees AK = (ax_-1,aK]| 


or with Ax = (ax_1,ax) if ax = œ. By construction, each stratum A; has 
probability p; under F’; for if Y has distribution F’, then 


P(Y E Ai) = F(a;) = F(a;-1) = Pj. 


Thus, defining strata for F with specified probabilities is straightforward, 
provided one can find the quantiles a;. Figure 4.4 displays ten equiprobable 
(pi = 1/K) strata for the standard normal distribution. 

To use the sets A;,...,Ax for stratified sampling, we need to be able to 
generate samples of Y conditional on Y € A;. As demonstrated in Exam- 
ple 2.2.5, this is easy using the inverse transform method. If U ~ Unif{0,1], 
then 

V =aj_-1 + U(a; — ay-1) 


is uniformly distributed between a;—ı and a; and then F'~'(V) has the distri- 
bution of Y conditional on Y € A;. 

Figure 4.5 illustrates the difference between stratified and random sam- 
pling from the standard normal distribution. The left panel is a histogram of 
500 observations, five from each of 100 equiprobable strata; the right panel is 
a histogram of 500 independent draws from the normal distribution. Stratifi- 
cation clearly produces a better approximation to the underlying distribution. 
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Fig. 4.4. A partition of the real line into ten intervals of equal probability under the 
standard normal distribution. The area under the normal density over each interval 


is 1/10. 


How might we use stratified samples from the normal distribution in sim- 
ulating paths of a stochastic process? It would not be legitimate to use one 
value from each of 100 strata to generate 100 steps of a single Brownian path: 
the increments of Brownian motion are independent but the stratified values 
are not, and ignoring this dependence would produce nonsensical results. In 
contrast, we could validly use the stratified values to generate the first incre- 
ment of 100 replications of a single Brownian path (or the terminal values of 
the paths, as explained in Section 4.3.2). In short, in using stratified sampling 
or any other variance reduction technique, we are free to introduce dependence 
across replications but not within replications. O 


Fig. 4.5. Comparison of stratified sample (left) and random sample (right). The 
stratified sample uses 100 equiprobable strata with five samples from.each stratum; 
the random sample consists of 500 independent draws from the normal distribution. 


Both histograms use 25 bins. 
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Example 4.3.3 Stratification through acceptance-rejection. A crude but al- 
most universally applicable method for generating samples conditional on a 
stratum generates unconditional samples and keeps those that fall in the target 
set. This is the method described in Example 2.2.8 for conditional sampling, 
and may be viewed as a form of acceptance-rejection in which the acceptance 
probability is always 0 or 1. 

To describe this in more detail, we use the notation of (4.33). Our goal is 
to generate samples of the pair (X,Y) using strata Aj,...,AxK for X, with 
ni samples to be generated conditional on X € Aj, i = 1,...,K. Given 
a mechanism for generating unconditional samples from the distribution of 
(X,Y), we can repeatedly generate such samples until we have produced n; 
samples with X € A; for each i = 1,..., K; any extra samples generated from 
a stratum are simply rejected. 

The efficiency of this method depends on the computational cost of gen- 
erating pairs (X,Y) and determining the stratum in which X falls. It also 
depends on the stratum probabilities: if P(X € A;) is small, a large number 
of candidates may be required to produce n; samples from A;. These com- 
putational costs must be balanced against the reduction in variance achieved 
through stratification. Glasserman, Heidelberger, and Shahabuddin [143] an- 
alyze the overhead from rejected samples in this method based on a Poisson 
approximation to the arrival of samples from each stratum. O 


Example 4.3.4 Stratifying the unit hypercube. The methods described in Ex- 
amples 4.3.1 and 4.3.2 extend, in principle, to multiple dimensions. Using the 


inverse transform method, a vector (X1, ..., Xa) of independent random vari- 
ables can be represented as (F7 ‘(Ui),...,F7'(Ua)) with F; the distribution 
of X; and U;,...,Uq independent and uniform over [0,1). In this sense, it suf- 


fices to consider the uniform distribution over the d-dimensional hypercube 
[(0,1)¢. (In the case of dependent X,,..., Xa, replace F; with the conditional 
distribution of X; given X,,...,X;_-1.) In stratifying the unit hypercube with 
respect to the uniform distribution, it is convenient to take the strata to be 
products of intervals because the probability of such a set is easily calculated 
and because it is easy to sample uniformly from such a set by applying a 
transformation like (4.35) to each coordinate. 

Suppose, for example, that we stratify the jth coordinate of the hypercube 
into K; intervals of equal length. Each stratum of the hypercube has the form 


and has probability 1/(Kı --- Ka). To generate a vector V uniformly distrib- 
uted over this set, generate U;,...,Uq independently from Unif{[0,1) and define 
the 7th coordinate of V to be 

| i; -1+0, 


= = Fe Oren . 
V; g © dhe (4.36) 
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In this example, the total number of strata is K;--- Ka. Generating at least 
one point from each stratum therefore requires a sample size at least this large. 
Unless the K; are quite small (in which case stratification may provide little 
benefit), this is likely to be prohibitive for d larger than 5, say. In Section 4.4 
and Chapter 5, we will see methods related to stratified sampling that are 
better suited to higher dimensions. O 


Output Analysis 


We now turn to the problem of interval estimation for ys & EY] using strat- 
ified sampling. As in (4.33), let A1,..., Ag denote strata for a stratification 
variable X and let Y;; have the distribution of Y conditional on X € A;. For 
ædes Aet 

o? = Var|Y;;] = Var[Y|X € Ail. (4.38) 
Let pi = P(X € A;), i = 1,..., K, denote the stratum probabilities; we re- 
quire these to be strictly positive and to sum to 1. Fix an allocation n1,... nK 
with all n; > 1 and nı +---+nK =n. Let q; = n;/n denote the fraction of 


samples allocated to the ith stratum. For any such allocation the estimator 
Y in (4.33) is unbiased because 


: K 1 ni K 
EY] = $ pi — DEM] = $ pim =p 
i=1 t j=1 i=1 


The variance of Y is given by 


Var|Y] = > Pi Var z 2 Yi = De a 
t=1 j=1 i=1 
with 
K Fe 
o7(q) = >> +07. (4.39) 
—, qi 
71 
For each stratum A;, the samples Yj, Y;2,... are i.i.d. with mean u; and 


variance g? and thus satisfy 


1 [ra] 
`S (Yij z Hi) = N(0, 0%); 


V [nq | j=l 


as n — oo with q1,..., qg fixed. The centered and scaled estimator JnlY — u) 
can be written as 
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. K 1 [nq:] 
vn(Y — u) = Vn > pi (Yiz — Hi) 
i=1 [nai j=l 
DE Lng: ] 
x D i ma a — pi) |, 


the approximation holding in the sense that the ratio of the two expressions 
approaches 1 as n — oo. This shows that nY — u) is asymptotically a 
linear combination (with coefficients p; / vdi) of independent normal random 
variables (with mean 0 and variances a7). It follows that 


Vn(¥ — u) = N(0,07(q)) 


with o7(q) as defined in (4.39). This limit holds as the sample size n increases 
with the number of strata K held fixed. : 

A consequence of this central limit theorem for Y is the asymptotic validity 
of (a) 
A a(q 
Ye —= 4.40 

£6 /2 Jn ( ) 

as a 1 — ô confidence interval for u, with zs. = @~*(1 — 6/2). In practice, 
o*(q) is typically unknown but can be consistently estimated using 


TER 
LORDIDE TH 
pl 


where s? is the sample standard deviation of Y;1,..., Yin,. 

Alternatively, one can estimate o7(q) through independent replications of 
Y. More precisely, suppose the sample size n can be expressed as mk with m 
and k integers and m > 2. Suppose k; = qik is an integer for all i = 1,..., K 
and note that n; = mk;. Then Y is the average of m independent stratified 
estimators Yi, ee Yri each of which allocates a fraction q; of observations to 
stratum 7 and has a total sample size of k. Each Y; thus has variance o7(q) /k; 
because Y is the average of the Yi, aT Ar an asymptotically (as m — co) 
valid confidence interval for u is provided by 


ayk (4.41) 


This reduces to (4.40), but o(q)/Wk can now be consistently estimated using 
the sample standard deviation of Yi, e Ýn. This is usually more convenient 
than estimating all the stratum variances gf, i = 1,..., K. 

In this formulation, each Ý; may be donn of as a atch with sample size 
k and the original estimator Y as the sample mean of m independent batches. 
Given a total sample size n., is it preferable to have at. least. m. ahservatiane 


Ý + 25/2 
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from each stratum, as in this setting, or to increase the number of strata so 
that only one observation is drawn from each? A larger m should improve our 
estimate of o(q) and the accuracy of the normal approximation implicit in the 
confidence intervals above. However, we will see below (cf. (4.46)) that taking 
finer strata reduces variance. Thus, as is often the case, we face a tradeoff 
between reducing variance and accurately measuring variance. 


Optimal Allocation 


In the case of a proportional allocation of samples to strata, qi = p; and the 
variance parameter o*(q) simplifies to 


K p2 K 
S to} = Y pio?. (4.42) 
i i=l 


To compare this to the variance without stratification, observe that 


K K 
EY?) = 5 EV IXE A] = y pi(o? + 49), 
i=} 


w=l1 


so using u = Spear Pili We get 


K K K a 
Var[Y] = E[Y?] - p? = X pio? +$ pi? ~ (Spn) (4.43) 
4=1 i=] t= 1. 


By Jensen’s inequality, 


K K : 
S pu = (>: pin) 
i=1 i=l 


with strict inequality unless all u; are equal. Thus, comparing (4.42) and 
(4.43), we conclude that stratified sampling with a proportional allocation can 


only decrease variance. 
Optimizing the allocation can produce further variance reduction. Mini- 


mizing o° (q) subject to the constraint that (q1, .. ., qg ) be a probability vector 
yields the optimal allocation 


= Pili 
D N 
a= PLO? 


In other words, the optimal allocation for each stratum is proportional to the 
product of the stratum probability and the stratum standard deviation. The 


optimal variance is thus 


q; oe eee. 
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K 9 K 2 
o°(q*) = So te? = (Zra) 
e i=1 
Comparison with (4.42) indicates that the additional reduction in variance 
from optimizing the allocation is greatest when the stratum standard devia- 
tions vary widely. 

In practice, the o; are rarely known so the optimal fractions q* are not 
directly applicable. Nevertheless, it is often practical to use pilot runs to get 
estimates of the g; and thus of the g*. The estimated optimal fractions can 
then be used to allocate samples to strata in a second (typically larger) set of 


runs. 

In taking the optimal allocation to be the one that minimizes variance, 
we are implicitly assuming that the computational effort required to generate 
samples is the same across strata. But this assumption is not always appro- 
priate. For example, in sampling from strata through acceptance-rejection as 
described in Example 4.3.3, the expected time required to sample from A; is 
proportional to 1/p;. A more complete analysis should therefore account for 
differences in computational costs across strata. 

Suppose, then, that 7; denotes the expected computing time required to 
sample (X, Y) conditional on X € A; and let s denote the total computing 
budget. Let Y(s) denote the stratified estimator produced with a budget s, 
assuming the fraction of samples allocated to stratum 7 is q;. (This is asymp- 
totically equivalent to assuming the fraction of the computational budget al- 
located to stratum 7 is proportional to q;7;.) Arguing much as in Section 1.1.3, 


we find that 


with 


By minimizing this work-normalized variance parameter we find that the op- 


timal allocation is 
«Pitis 


Gy ene ae 
ep peo e/ Te 
which now accounts for differences in computational costs across strata. Like 
the o;, the 7; can be estimated through pilot runs. 


Variance Decomposition 


The preceding discussion considers the allocation of samples to given strata. 
In order to consider the question of how strata should be selected in the first 
place, we now examine what part of the variance of Y is removed through 


stratification of X. 
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As before, let A1,...,Ax be strata for X. Let n = n(X) € {1,..., K} 
denote the index of the stratum containing X, so that X € A,. We can 
always write 

Y = E|[Y |n] +€ (4.44) 


simply by defining the residual € so that equality holds. It is immediate that 
Efe|n] = 0 and that e is uncorrelated with E[Y|n] because 


Ele (E[Y |n] — E[Y])] = 9, 


as can be seen by first conditioning on 7. Because (4.44) decomposes Y into 
the sum of uncorrelated terms, we have 


Var[Y] = Var[E[Y |7]] + Var[e]. 


We will see that stratified sampling with proportional allocation eliminates 
the first term on the right, leaving only the variance of the residual term and 
thus guaranteeing a variance reduction. 

The residual variance is E[e?] because Efe] = 0. Also, 


Efe“|n] = E [(Y — E[Y |n])’ In] = Var[Y |n]. 
We thus arrive at the familiar decomposition 
Var[Y] = Var[E[Y|n]] + E [Var[Y |nņ]]. (4.45) 


The conditional expectation of Y given ņ = 12 is ui, and the probability that 
n = i is pi. The first term on the right side of (4.45) is thus 


K 2 
Var[E[Y |7]] = Spt - (Sp) 
i=1 


Comparing this with (4.43), we conclude from (4.44) and (4.45) that 


Varle] = E [Var[Y]7]] = Snot a, 


which is precisely the variance parameter in (4.42) for stratified sampling 
with proportional allocation. This confirms that the variance parameter of 
the stratified estimator is the variance of the residual of Y after conditioning 
on 7. 

Consider now the effect of alternative choices of strata. The total variance 
Var[Y] in (4.45) is constant, so making the residual variance small is equivalent 
to making Var/E[Y |7]] large — i.e., to making Var|ju,,| large. This indicates that 
we should try to choose strata to achieve a high degree of variability across the 
stratum means j11,...,4« and low variability within each stratum. Indeed, 
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from (4.45) we find that stratification eliminates inter-stratum variability, 
leaving only intra-stratum variability. 

Another consequence of (4.45) is that further stratification results in fur- 
ther variance reduction. More precisely, suppose the partition { Aj, | RI 
refines the partition {Ai,..., Ag }, in the sense that the stratum index ñ of 
the new partition completely determines 7. Then EfY |n] = E[E[Y|7]|n] and 
Jensen’s inequality yields 


Var[E[Y In] < Var[E(Y ñ] (4.46) 


from which it follows that the residual variance from the refined strata cannot 
exceed the residual variance from the original strata. 

The decomposition (4.44) invites a comparison between stratified sampling 
and control variates. Consider the case of real-valued X. Using the method of 
Example 4.3.2, we can in principle stratify X using an arbitrarily large number 
of equiprobable intervals. As we refine the stratification, it is reasonable to 
expect that E[Y |n] will approach E[Y|X]. (For a specific result of this type 
see Lemma 4.1 of Glasserman, Heidelberger, and Shahabuddin [139].) The 
decomposition (4.44) becomes 


Y=EY |X] +e=9(X) +e, 


with g(x) = E[Y|X = zx]. If g is linear, then the variance removed through 
(infinitely fine) stratification of X is precisely the same as the variance that 
would be removed using X as a control variate. But in the general case, using 
X as a control variate would remove only the variance associated with the 
linear part of g near ELX]; see the discussion in Section 4.1.4. In contrast, 
infinitely fine stratification of X removes all the variance of g( X) leaving only 
the variance of the residual e. In this sense, using X as a stratification variable 
is more effective than using it as a control variate. However, it should also be 
noted that using X as a control requires knowledge only of E|X] and not the 
full distribution of X; moreover, it is often easier to use X as a control than 
to generate samples from the conditional law of (X,Y) given the stratum 


containing X. 


4.3.2 Applications 


This section illustrates the application of stratified sampling in settings some- 
what more complex than those in Examples 4.3.1—4.3.4. As noted in Exam- 
ple 4.3.4, fully stratifying a random vector becomes infeasible in high dimen- 
sions. We therefore focus primarily on methods that stratify a scalar projection 
in a multidimensional setting. This can be effective in valuing a derivative se- 
curity if its discounted payoff is highly variable along the selected projection. 


Terminal Stratification 


In the pricing of options, the most important feature of the path of an under- 
lying asset is often its value at the option expiration: much of the variahilitv 
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in the option’s payoff can potentially be eliminated by stratifying the termi- 
nal value. As a step in this direction, we detail the stratification of Brownian 
motion along its terminal value. In the special case of an asset described 
by geometric Brownian motion with constant volatility, this is equivalent to 
stratifying the terminal value of the asset price itself. 
Suppose, then, that we need to generate a discrete Brownian path W (t1), 
.., W(tm) and that we want to stratify the terminal value W (tm). We can 
accomplish this through a variant of the Brownian bridge construction of 
Brownian motion presented in Section 3.1. Using the inverse transform method 
as in Example 4.3.2 we can stratify W (tm), and then conditional on each value 
of W (tm) we can generate the intermediate values W(t1),...,W(tm-_1). 
Consider, in particular, the case of K equiprobable strata and a propor- 


tional allocation. Let U;,...,UK be independent Unif[0,1] random variables 
and set dt E 
7 =n . 
V; = —, i=1,...,K. 
K zg Ke ! 1 


Then Ẹ7t!(V1),...,$71(Vg) form a stratified sample from the standard nor- 
mal distribution and Vtm®™! (V1), ---, Vtm® 1 (Vm) form a stratified sample 
from N (0, tm), the distribution of W (tm). To fill in the path leading to each 
W (tm), we recall from Section 3.1 that the conditional distribution of W (t;) 


given W(t;1) and W (tm) is 


try, — t; t; —t;— bon, = £5 (03 = Tee 
v( m J W(tj3-1) + J Jal W (tn), Éz i) (t; J 2), 
ta tj tm — tj—1 tm — tj—1 


with to = 0 and W(0) = 0. 
The following algorithm implements this idea to generate K Brownian 
paths stratified along W (tm): 


forr = Nyse cht 
generate U ~ Unif/0,1] 
Ves 4UNK 
W (tm) — Vtm® 1 (V) 
forj =1,...,m—1 
generate Z ~ N(0,1) 


tact teb ET E 
W(t;) — ST W(tj-1) + tE W (tm) PAJ nel z 


Figure 4.6 illustrates the construction. Of the K = 10 paths generated by this 
algorithm, exactly one terminates in each of the K = 10 strata defined for 
W (tm), here with tm = 1. 

If the underlying asset price S(t) is modeled by geometric Brownian mo- 
tion, then driving the simulation of S with these Brownian paths stratifies the 
terminal asset price S(t,,); this is a consequence of the fact that S(tm) is a 
monotone transformation of W (tm). In valuing an option on S, rather than 
constructing equiprobable strata over all possible terminal values, we may 
combine all values of S(t...) that result in zero payoff into a single stratum 
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Fig. 4.6. Simulation of K Brownian paths using terminal stratification. One path 
reaches each of the K strata. The strata are equiprobable under the distribution of 


W(1). 


and create a finer stratification of terminal values that potentially produce a 
nonzero payoff. (The payoff will not be completely determined by S(tm) if it 


is path-dependent. ) 
As an example of how a similar construction can be used in a more complex 


example, consider the dynamics of a forward LIBOR. rate Ln as in (4.6). 
Consider a single-factor model (so that W is a scalar Brownian motion) with 
deterministic but time-varying volatility on = o. Without the drift term in 
(4.6), the terminal value L,,(tm) would be determined by 


i o(u) dW (u) 


rather than W (tm), so we may prefer to stratify this integral instead. If ø is 
constant over each interval ft;, ti+1), this integral simplifies to 


W (tm )o(tm—1) + D W (ti)lo(ti—1) =s o(ti)l. (4.47) 


i=] 


Similarly, for some path-dependent options one may want to stratify the av- 


erage 


~ D W (ts). (4.48) 


In both cases, the stratification variable is a linear combination of the W (t;) 
and is thus a special case of the general problem treated next. 
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Stratifying a Linear Projection 


Generating (W(t,),...,W(tm)) stratified along W (tm) or (4.47) or (4.48) are 
all special cases of the problem of generating a multivariate normal random 
vector stratified along some projection. We now turn to this general formula- 
tion of the problem. 

Suppose, then, that € ~ N(y, £) in RÉ and that we want to generate € with 
X = v'€ stratified for some fixed vector v € Rİ. Suppose the d x d matrix X 
has full rank. We may take pz to be the zero vector because stratifying v! € is 
equivalent to stratifying v! (€ — u) since v! u is a constant. Also, stratifying 
X is equivalent to stratifying any multiple of X; by scaling v if necessary, we 
may therefore assume that v! Sv = 1. Thus, 


X =v'E~ N(0,v' Dv) = N(O,1), 


so we know how to stratify X using the method in Example 4.3.1. 
The next step is to generate € conditional on the value of X. First observe 
that € and X are jointly normal with 


(a)r (Garsa) 


Using the Conditioning Formula (2.25), we find that 


dav Yeu! Dd , T 
(|X =z) ~N (eee TS, ) = N (£vz, 0 — Dev’ D). 


Observe that the conditional covariance matrix does not depend on zxz; this is 
important because it means that only a single factorization is required for the 
conditional sampling. Let A be any matrix for which AA! = ¥ (such as the 
one found by Cholesky factorization) and observe that 


(A — Suv! A)(A— Ew! A)! 
— AA! — AA! ov E — Nov! AA! + Nov! Dov Dd 
= $ — You D, 


again using the fact that v' Xv = 1. Thus, we can use the matrix A — vv! A 
to sample from the conditional distribution of € given X. 

The following algorithm generates K samples from N(0, £) stratified along 
the direction determined by v: 


fort = lyca 
generate U ~ Unif[0,1] 
V — (i—1+U)/K 
XG lV) 
generate Z ~ N(0, T) in R? 
E — wX + (A -— Ew! A)Z 
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By construction, of the K values of X generated by this algorithm, exactly one 
will fall in each of K equiprobable strata for the standard normal distribution. 
But observe that under this construction, 


v't =v' vX +v'(A-— Ew! A)Z =X. 


Thus, of the K values of € generated, exactly one has a projection v' £ falling 
into each of K equiprobable strata. In this sense, the algorithm generates 
samples from N (0, ©} stratified along the direction determined by v. 

To apply this method to generate a Brownian path with the integral in 
(4.47) stratified, take X to be the covariance matrix of the Brownian path 


(Xj; = min(t;, tj), as in (3.6)) and 
v x (o(0) — a(1),a(1)— o(2),...,0(m — 2) — o(m — 1),0(m—1))', 


normalized so that v! Xv = 1. To generate the path with its average (4.48) 
stratified, take v to be the vector with all entries equal to the square root of 
the sum of the entries of X. Yet another strategy for choosing stratification 
directions is to use the principal components of X (cf. Section 2.3.3). 

Further simplification is possible in stratifying a sample from the standard 
multivariate normal distribution N(0, J). In this case, the construction above 


becomes 
€é=vX+(I-vv')Z, X~N(0,1), Z~N(0,J), 


with v now normalized so that v! v = 1. Since X = v! £, by stratifying X we 
stratify the projection of € onto v. The special feature of this setting is that 
the matrix-vector product (I — wvv! )Z can be evaluated as Z — v(v! Z), which 
requires O(d) operations rather than O(d?). 

This construction extends easily to allow stratification along multiple di- 
rections simultaneously. Let B denote a d x m matrix, m < d, whose columns 
represent the stratification directions. Suppose B has been normalized so that 
B' XB = I. If X itself is the identity matrix, this says that the m columns of 
B form a set of orthonormal vectors in Rt. Where we previously stratified the 
scalar projection X = v! £, we now stratify the m-vector X = B'E, noting 
that X ~ N(0, I). For this, we first stratify the m-dimensional hypercube as 
in Example 4.3.4 and then set X; = ®-1(V;), j =1,...,m, with (Vi,..., Vin) 
sampled from a stratum of the hypercube as in (4.36). This samples X from 
the m-dimensional standard normal distribution with each of its components 


stratified. We then set 
€=NBX+(A-—ZBB'A)Z, Z~N(0,1I), 


for any d x d matrix A satisfying AA! = X. The projection BTE of £ onto 
the columns of B returns X and is stratified by construction. 

To illustrate this method, we apply it to the LIBOR market model dis- 
cussed in Section 3.7, using the notation and terminology of that section. We 
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use accrual intervals of length 6 = 1/2 and set all forward rates initially equal 
to 6%. We consider a single-factor model (i.e., one driven by a scalar Brownian 
motion) with a piecewise constant stationary volatility, meaning that o,(t) 
depends on n and t only through n — 7(t), the number of maturity dates re- 
maining until Tn. We consider a model in which volatility decreases linearly 
from 0.20 to 0.10 over a 20-year horizon, and a model in which all forward 
rate volatilities are 0.15. 

Our simulation uses a time increment equal to 6, throughout which volatil- 
ities are constant. We therefore write 


/ ” on (t) dW (t) = vS 5 D (4.49) 


with Z1, Z2,... independent N (0,1) variables. This suggests using the vec- 
tor (on(0), on(T1), ---, On(Tn—-1)) as the stratification direction in sampling 
(Z1,...,2n) from the standard normal distribution in R”. 

Table 4.2 reports estimated variance reduction ratios for pricing various 
options in this model. Each entry in the table gives an estimate of the ratio 
of the variance using ordinary Monte Carlo to the variance using a stratified 
sample of equal size. The results are based on 40 strata (or simply 40 indepen- 
dent samples for ordinary Monte Carlo); the estimated ratios are based 1000 
replications, each replication using a sample size of 40. The 1000 replications 
merely serve to make the ratio estimates reliable; the ratios themselves should 
be interpreted as the variance reduction achieved by using 40 strata. 

The results shown are for a caplet with a maturity of 20 years, a caplet with 
a maturity of 5 years, bonds with maturities 20.5 and 5.5 years, and a swaption 
maturing in 5 years to enter into a 5-year, fixed-for-floating interest rate swap. 
The options are all at-the-money. The results are based on simulation in the 
spot measure using the log-Euler scheme in (3.120), except for the last row 
which applies to the forward measure for maturity 20.5. In each case, the 
stratification direction is based on the relevant portion of the volatility vector 
— forty components for a 20-year simulation, ten components for a 5-year 
simulation. (Discounting a payment to be received at T;,,1 requires simulating 
to Th.) 

The results in the table indicate that the variance reduction achieved varies 
widely but can be quite substantial. With notable exceptions, we generally see 
greater variance reduction at shorter maturities, at least in part because of 
the discount factor (see, e.g., (3.109)). The stratification direction we use is 
tailored to a particular rate L, (through (4.49)), but not necessarily to the 
discount factor. The discount factor becomes a constant under the forward 
measure and, accordingly, we see a greater variance reduction in this case, 
at the same maturity. Surprisingly, we find the greatest improvement in the 
case of the swaption, even though the stratification direction is not specifically 


tailored to the swap rate. 
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Linear Constant 
volatility volatility 
Spot Measure 


Caplet, T = 20 2 8 
Caplet, T = 5 26 50 
Swaption, T = 5 38 79 
Bond, T = 20.5 12 4 
Bond, T = 5.5 5 4 


Forward Measure 
Caplet, T = 20 11 11 
Table 4.2. Variance reduction factors using one-dimensional stratified sampling 
in a single-factor LIBOR. market model. The stratification direction is determined 
by the vector of volatilities. The results are based on 1000 replications of samples 
(stratified or independent) of size 40. For each instrument, the value of T indicates 


the maturity. 


Optimal Directions 


In estimating E[f(€)] with € ~ N (u, X) and f a function from Rt to R, it would 
be convenient to know the stratification direction v for which stratifying v! € 
would produce the greatest reduction in variance. Finding this optimal v is 
rarely possible; we give a few examples for which the optimal direction is 
available explicitly. 

With no essential loss of generality, we restrict attention to the case of 
E[f(Z)] with Z ~ N(0, I). From the variance decomposition (4.45) and the 
surrounding discussion, we know that the residual variance after stratifying 
a linear combination v! Z is E[Var[f(Z)|n]], where 7 is the (random) index 
of the stratum containing v' Z. If we use equiprobable strata and let the 
number of strata grow (with each new set of strata refining the previous set), 
this residual variance converges to E[Var[f(Z)|v ' Z]] (cf. Lemma 4.1 of [139]). 
We will therefore compare alternative choices of v through this limiting value. 

In the linear case f(z) = b! z, it is evident that the optimal direction is 
v = b. Next, let f(z) = z! Az for some d x d matrix A. We may assume that 
A is symmetric and thus that it has real eigenvalues A; > Ap > +: > Ag and 
associated orthonormal eigenvectors v1,..., Vg. Minimizing E[Var[f(Z)|v' Z]] 
over vectors v for which v’ v = 1 is equivalent to maximizing Var[E[f(Z)|v ' Z]] 
over the same set because the two terms sum to Var[f(Z)] for any v. In the 
quadratic case, some matrix algebra shows that v'v = 1 implies 


Var[E[Z' AZ|u' Z]] = (v! Av)?. 


This is maximized by vı if A? > A and by va if A2 > A2. In other words, 
the optimal stratification direction is an eigenvector of A associated with an 
eigenvalue of largest absolute value. The effect of optimal stratification is to 


reduce variance from )>, A? to 5°, A? — max; A?. 
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As a final case, let f(z) = exp($z' Az). For f(Z) to have finite second 
moment, we now require that 4; < 1/2. Theorem 4.1 of [139] shows that 
the optimal stratification direction in this case is an eigenvector vj» where j* 


satisfies > ; 
Àj» rj 
(5) = max (7#) | a 


As in the previous case, this criterion will always select either A; or Ag, but 
it will not necessarily select the one with largest absolute value. 

Simulation is unnecessary for evaluation of E[f(Z)] in each of these exam- 
ples. Nevertheless, a linear, quadratic, or exponential-quadratic function may 
be useful as an approximation to a more general f and thus as a guide in 
selecting stratification directions. Fox [127] uses quadratic approximations for 
related purposes in implementing quasi-Monte Carlo methods. Glasserman, 
Heidelberger, and Shahabuddin [139] use an exponential-quadratic approxi- 
mation for stratified sampling in option pricing; in their application, A is the 
Hessian of the logarithm of an option’s discounted payoff. We discuss this 
method in Section 4.6.2. 


Radial Stratification 


The symmetry of the standard multivariate normal distribution makes it 
possible to draw samples from this distribution with stratified norm. For 
Z ~ N(0,T) in RÊ, let 

X= Z7+---+Z5, 


so that VX = ||Z|| is the radius of the sphere on which Z falls. The distribution 
of X is chi-square with d degrees of freedom (abbreviated 4) and is given 
explicitly in (3.71). Section 3.4.2 discusses efficient methods for sampling from 
x4, but for stratification it is more convenient to use the inverse transform 
method as explained in Example 4.3.2. There is no closed-form expression for 
the inverse of the x4 distribution, but the inverse can be evaluated numerically 
and methods for doing this are available in many statistical software libraries 
(see, e.g., the survey in Section 18.5 of [201]). Hence, by generating a stratified 
sample from Unif[0,1] and then applying the inverse of the x4 distribution, 
we can generate stratified values of X. 

The next step is to sample Z conditional on the value of the stratification 
variable X. Because of the symmetry of the normal distribution, given X the 
vector Z is uniformly distributed on the sphere of radius vV X. This is the basis 
of the Box-Muller method (cf. Section 2.3) in dimension 2 but it holds for all 
dimensions d. To sample uniformly from the sphere of radius R = VX in #2, 
we can extend the Box-Muller construction as follows: sample U;,...,Uq—1 
independently from Unif[0,1] and set 


Z, = Rceos(2nU;) 
Zə = Rsin(2rU,) cos(21U2) 
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Zq—1 = Rsin(27rU,) sin(27U2) - - - sin(27Ug_2) cos(27Ug_1) 
Za = Rsin(27U,) sin(27U2) - -- sin(2rUg_2) sin(27Ug_1). 


Alternatively, given a method for generating standard normal random vari- 
ables we can avoid the evaluation of sines and cosines. If €,,...,€4 are indepen- 
dent N(0,1) random variables and € = (€;,...,&4)', then €/||€|| is uniformly 
distributed over the unit sphere and 


so 
oe Tg] 


is uniformly distributed over the sphere of radius of R. 

It should be noted that neither of these constructions extends easily to 
stratified sampling from N (0, £) for general X. If ¢ ~ N(0,=) and X = ETG, 
then X ~ x4 and we can stratify X just as before; moreover, given X, Ç is 
uniformly distributed over the ellipsoid 


Hx ={xe RI: g'ir = X}. 


The difficulty lies in sampling uniformly from the ellipsoid. Extending the 
Box-Muller construction entails replacing the sines and cosines with elliptic 
functions. The second construction does not appear to generalize at all: if 
€ ~ N(0, E), the vector VXE/y/£TETIE lies on the ellipsoid Hx but is not 


uniformly distributed over the ellipsoid. 
The construction does, however, generalize beyond the standard normal 


to the class of spherically contoured distributions. The random vector Y is 
said to have a spherically contoured distribution if its conditional distribution 
given ||Y || is uniform over the sphere of radius ||Y ||; see Fang, Kotz, and Ng 
[114]. To stratify Y along its radius, we must therefore stratify X = ||Y|I, 
which will not be x2 except in the normal case. Given X, we can sample Y 
uniformly from the sphere of radius || X || using either of the methods described 
above for the normal distribution. 

Radial stratification is proposed and applied in [142] as a method for re- 
ducing variance in estimating the risk in a portfolio of options for which losses 
result from large moves of the underlying assets in any direction. 


Stratifying a Poisson Process 


In this example, we generate a Poisson process on |0, T] with the total number 
of jumps in this interval stratified. Let A denote the arrival rate for the Poisson 
process and N the number of jumps in [0,7]. Then N is a Poisson random 


variable with distribution 
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We can sample from this distribution using the inverse transform method, as in 
Figure 3.9, and thus generate a stratified sample of values as in Example 4.3.2. 

For each value of N in the stratified sample, we need to generate the 
arrival times of the jumps in [0,7] conditional on the number of jumps N. 
For this we use a standard property of the Poisson process: given N = k, the 
arrival times of the jumps have the joint distribution of the order statistics of 
k independent random variables uniformly distributed over [0, T]. Thus, we 
may start from Unif[0,1] random variables U,,...,U,, multiply them by T 
to make them uniform over [0, T], and then sort them in ascending order to 
obtain the arrival times. 

An alternative to sorting, detailed in Fox [127], samples directly from the 
joint distribution of the order statistics. Let Vi,...,V, and U;,...,U, denote 
independent Unif[0,1] random variables. Then 


Ye Vig VV, VE (4.51) 


have the joint distribution of the ascending order statistics of U;,...,U,. For 
example, 


P(max(U,,...,U,x) < 2) = P(U, < x)---P(U, < £) =2*, x € [0,1], 


and the last term in (4.51) simply samples from this distribution by applying 
its inverse to Vi. An induction argument verifies correctness of the remaining 
terms in (4.51). The products in (4.51) can be evaluated recursively from 
right to left. (To reduce round-off error, Fox [127] recommends recursively 
summing the logarithms of the vil (R-*+1) and then exponentiating.) The ith 
arrival time, t = 1,..., k, can then be generated as 


STV T eV a (4.52) 


i.e., by rescaling from [0,1] to [0, T]. Because the terms in (4.51) are generated 
from right to left, Fox [127] instead sets 


A yty 1/k-) S D 


this has the same distribution as (4.52) and allows generation of the arrival 
times in a single pass. (Subtracting the values in (4.51) from 1 maps the ith 
largest value to the ith smallest.) 

This method of stratification extends, in principle, to inhomogeneous Pois- 
son processes. The number of arrivals in [0, T] continues to be a Poisson ran- 
dom variable in this case. Conditional on the number of arrivals, the times of 
the arrivals continue to be distributed as order statistics, but now of random 
variables with a density proportional to the arrival rate A(t), t € [0, T], rather 
than a uniform density. 
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Terminal Stratification in a Binomial Lattice 


A binomial lattice provides a discrete-time, discrete-space approximation to 
the evolution of a diffusion process. Each node in the lattice (see Figure 4.7) 
is associated with a level of the underlying asset (or rate) S; over a single time 
step, the movement of the asset is restricted to two successor nodes, usually 
corresponding to a move up and a move down. By varying the spacing of the 
nodes and the transition probabilities, it is possible to vary the conditional 
mean and variance of the change in the underlying asset over a single time 
step, and thus to approximate virtually any diffusion processes. 


an 


Fig. 4.7. A four-step binomial lattice. Each node has an associated value S of the 
underlying asset. Each node has two successor nodes, corresponding to a move up 


and a move down. 


Binomial lattices are widely used for numerical option pricing. A typical 
algorithm proceeds by backward induction: an option contract determines the 
payoffs at the terminal nodes (which correspond to the option expiration); the 
option value at any other node is determined by discounting the values at its 
two successor nodes. 

Consider, for example, the pricing of a put with strike K. Each terminal 
node corresponds to some level S of the underlying asset at expiration and 
thus to an option value (K — S')*. A generic node in the lattice has an “up” 
successor node and a “down” successor node; suppose the option values V,, 
and Va, respectively, at the two successor nodes have already been calculated. 
If the probability of a move up is p, and if the discount factor over a single 
step is 1/(1 + R), then the value at the current node is 


1 


ss -0e 
V TIR? + (1 — p)Va) 
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In pricing an American put, the backward induction rule is 


V = max (5 OV +(1=pVa),K = 8) 

Binomial option pricing is ordinarily a deterministic calculation, but it can 
be combined with Monte Carlo. Some path-dependent options, for example, 
are more easily valued through simulation than through backward induction. 
In some cases, there are advantages to sampling paths through the binomial 
lattice rather than sampling paths of a diffusion. For example, in an interest 
rate lattice, it is possible to compute bond prices at every node. The avail- 
ability of these bond prices can be useful in pricing path-dependent options 
on, e.g., bonds or swaps through simulation. 

An ordinary simulation through a binomial lattice starts at the root node 
and generates moves up or down using the appropriate probabilities for each 
node. As a further illustration of stratified sampling, we show how to simulate 
paths through a binomial lattice with the terminal value stratified. 

Consider, first, the case of a binomial lattice for which the probability of 
an up move has the same value p at all nodes. In this case, the total number 
of up moves N through an m-step lattice has the binomial distribution 


m 


P(N = k) = (a-pe, k =0,1,...,m. 


Samples from this distribution can be generated using the inverse transform 
method for discrete distributions, as in Example 2.2.4, much as in the case 
of the Poisson distribution in Figure 3.9. As explained in Example 4.3.2, it is 
a simple matter to generate stratified samples from a distribution using the 
inverse transform method. Thus, we have a mechanism for stratifying the total 
number of up moves through the lattice. Since the terminal node is determined 
by the difference N — (m — N) = 2N —m between the number of up and down 
moves, stratifying N is equivalent to stratifying the terminal node. 

The next step is to sample a path through the lattice conditional on the 
terminal node — equivalently, conditional on the number of up moves N. The 
key observation for this procedure is that, given N, all paths through the 
lattice with N up moves (hence m — N down moves) are equally likely. Gen- 
erating a path conditional on N is simply a matter of randomly distributing 
N “ups” among m moves. At each step, the probability of a move up is the 
ratio of the number of remaining up moves to the number of remaining steps. 
The following algorithm implements this idea: 


kN (total number of up moves to be made) 
for i = 0,...,m — 1 

if k = 0 move down 

if k > m — i move up 

fO<k<m-—i 
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generate U ~ Unif[0,1] 
if (m —i)U <k 
k—k-I 
move up 
else move down 


The variable k records the number of remaining up moves and m — 7 is the 
number of remaining steps. The condition (m — i)U < k is satisfied with 
probability k/(m — i). This is the ratio of the number of remaining up moves 
to the number of remaining steps, and is thus the conditional probability of an 
up move on the next step. Repeating this algorithm for each of the stratified 
values of N produces a set of paths through the lattice with stratified terminal 
node. 

This method extends to lattices in which the probability of an up move 
varies from node to node, though this extension requires substantial addi- 
tional computing. The first step is to compute the distribution of the termi- 
nal node, which is no longer binomial. The probability of reaching a node can 
be calculated using the lattice itself: this probability is the “price” (without 
discounting) of a security that pays 1 in that node and 0 everywhere else. 
Through forward induction, the probabilities of all terminal nodes can be 
found in O(m?) operations. Once these are computed, it becomes possible to 
use the discrete inverse transform method to generate stratified samples from 
the terminal distribution. 

The next step is to simulate paths through the lattice conditional on a 
terminal node. For this, let p denote the unconditional probability of an up 
move at the current node. Let h, denote the unconditional probability of 
reaching the given terminal node from the up successor of the current node; 
let hg denote the corresponding probability from the down successor. Then 
the conditional probability of an up move at the current node (given the ter- 
minal node) is phy, /(phy+(1—p)ha) and the conditional probability of a down 
move is (1 — p)ha/ (phu + (1 — p)ha). Once the hu, and hg have been calculated 
at every node, it is therefore a simple matter to simulate paths conditional 
on a given terminal node by applying these conditional probabilities at each 
step. Implementing this requires calculation of O(m) conditional probabili- 
ties at every node, corresponding to the O(m) terminal nodes. These can be 
calculated with a total effort of O(m) using backward induction. 


4.3.3 Poststratification 


As should be evident from our discussion thus far, implementation of strat- 
ified sampling requires knowledge of stratum probabilities and a mechanism 
for conditional sampling from strata. Some of the examples discussed in Sec- 
tion 4.3.2 suggest that conditional sampling may be difficult even when com- 
puting stratum probabilities is not. Poststratificateon combines knowledge of 
stratum probabilities with ordinary independent sampling to reduce variance, 
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at least asymptotically. It can therefore provide an attractive alternative to 
genuine stratification when conditional sampling is costly. 

As before, suppose our objective is to estimate E[Y]. We have a mech- 
anism for generating independent replications (X1,Y1),...,(Xn,Yn) of the 
pair (X,Y); moreover, we know the probabilities p; = P(X € A;) for strata 
Ai,...,Ax. As usual, we require that these be positive and sum to 1. For 
4=1,...,K, let 


n 
N= > 1{X; = A;} 
j=l 
denote the number of samples that fall in stratum 7 and note that this is now 
a random variable. Let 


n 


S; = X 1{X; € Ai}Y; 


j=l 


denote the sum of those Y; for which X; falls in stratum 7, for îi = 1,...,K. 
The usual sample mean Y = (Yi +-:-+ Yn)/n can be written as 


= Sy tHe +t SK N; 
Y = ———__— =) —. 
n D 


leek 


at least if all N; are nonzero. By the strong law of large numbers, N;/n — pi 
and $;/N; — ui, with probability 1, where u; = E[Y|X € A] denotes the 
stratum mean, as in (4.37). Poststratification replaces the random fraction 
N;/n with its expectation p; to produce the estimator 


K S, 
Y = DPR. (4.53) 
i=1 


Whereas the sample mean Y assigns weight 1/n to every observation, the 
poststratified estimator weights values falling in stratum i by the ratio p;/N;. 
Thus, values from undersampled strata (N; < np;) get more weight and values 
from oversampled strata (N; > np;) get less weight. To cover the possibility 
that none of the n replications falls in the ith stratum, we replace S;/N; with 
zero in (4.53) if N; = 0. 

It is immediate from the almost sure convergence of $;/N; to u; that the 
poststratified estimator Y is a consistent estimator of E[Y]. Less clear are its 
merits relative to the ordinary sample mean or a genuinely stratified estimator. 
We will see that, asymptotically as the sample size grows, poststratification 
is as effective as stratified sampling in reducing variance. To establish this 
result, we first consider properties of ratio estimators more generally. 
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Ratio Estimators 


We digress briefly to derive a central limit theorem for ratio estimators. For 
this discussion, let (Ri, Qi), i = 1,2,..., be independent and identically dis- 
tributed pairs of random variables with 


E[R:] = ur, E[Q:] = wo, Var[R;] = c}, Var[Q;] = oO, Cov[R;, Qi] = ora; 


and tig # 0. The sample means of the first n values are 
Ray, 0-30 
ae : 23 77 n a 


By the strong law of large numbers, the ratio R/Q converges with probability 


1 to uR/LaQ. 
By applying the delta method introduced in Section 4.1.4 to the function 


h(z,y) = x/y, we obtain a central limit theorem of the form 


for the ratio estimator. The variance parameter g? is given by the general 
expression in (4.26) for the delta method and simplifies in this case to 


— UR 
2 ee! Ue 2 2UR OR — Var| HQ Q] 4.54 
O =-~ IĪR s CRO oon e eee (4. ) 
HQ HQ HQ HQ 


This parameter is consistently estimated by 
= 37 (Ri - RQ/0) / n(Q)? 
i=1 
from which we obtain an asymptotically valid 1 — 6 confidence interval 


S 


r: 26/2 


Ol w: 


with 26/2 = —6~*(5/2). 
For fixed n, R/Q is a biased estimator of ur/p~g. The bias has the form 


n 


Run] _ (unob/ud) - (oral) | oy) 
i k 4 7 aa 


see, e.g., Fishman [121], p.109. Subtracting an estimate of the leading term 
can reduce the bias to O(1/n?). 
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Poststratification: Asymptotic Variance 


We now apply this analysis of ratio estimators to derive a central limit theo- 
rem for the poststratified estimator Y, which is a linear combination of ratio 
estimators. A straightforward extension of the result for a single ratio gives 


S S 
Vai (FEM das Ee) = N(0,%), 
K 


Ny 
with the limiting matrix © again determined by the delta method. For the 
diagonal entries of X, (4.54) gives 
E Var[Y1{X < A;} = pi l{X = A;}] of 


dii 
p Pi 
with o? the stratum variance defined in (4.38). A similar calculation for j Æ i 
gives 
o CovIY = u) HX € Ads (Y —uj)UX E4) _ 
Pip 
because A; and A, are disjoint. 
The poststratified estimator satisfies 


‘A Al g. 
Y-p= pi (2 - «| 
i=l á 


and therefore i 
VnlY — u] = N(0,0°) 


with 


K K 
g’ = ` PiXijPj = > po: 
i=l 


1,j=1 


This is precisely the asymptotic variance for the stratified estimator based on 
proportional allocation of samples to strata; see (4.42). It can be estimated 
consistently by replacing each o? with the sample variance of the observations 
falling in the ith stratum. 

From this result we see that in the large-sample limit, we can extract all 
the variance reduction of stratified sampling without having to sample condi- 
tionally from the strata by instead weighting each observation according to its 
stratum. How large the sample needs to be for the two methods to give similar 
results depends in part on the number of strata and their probabilities. There 
is no simple way to determine at what sample size this limit becomes rele- 
vant without experimentation. But stratified sampling is generally preferable 
and poststratification is best viewed as an alternative for settings in which 
conditional sampling from the strata is difficult. 
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Latin hypercube sampling is an extension of stratification for sampling in 
multiple dimensions. Recall from the discussion in Example 4.3.4 that strat- 
ified sampling in high dimensions is possible in principle but often infeasible 
in practice. The difficulty is apparent even in the simple case of sampling 
from the d-dimensional hypercube [0,1)¢. Partitioning each coordinate into 
K strata produces Kĉ strata for the hypercube, thus requiring a sample size 
of at least Kĉ to ensure that each stratum is sampled. For even moderately 
large d, this may be prohibitive unless K is small, in which case stratification 
provides little benefit. For this reason, in Section 4.3.2 we focused on methods 
for stratifying a small number of important directions in multidimensional 
problems. 

Latin hypercube sampling treats all coordinates equally and avoids the ex- 
ponential growth in sample size resulting from full stratification by stratifying 
only the one-dimensional marginals of a multidimensional joint distribution. 
The method, introduced by McKay, Conover, and Beckman [259] and further 
analyzed in Stein [337], is most easily described in the case of sampling from 
the uniform distribution over the unit hypercube. Fix a dimension d and a 
sample size K. For each coordinate i = 1,...,d, independently generate a 
stratified sample Vi, TER p from the unit interval using K equiprobable 


strata; each Ag ) ig uniformly distributed over [(j —1)/K,7/K). If we arrange 
the d stratified samples in columns, 


VO y@ ... y® 
VO yO... yO 


K K K 
VO y |.. y(5) 


then each row gives the coordinates of a point in [0, 1). The first row identifies 
a point in [0,1/K)*, the second a point in [1/K,2/K)?, and so on, correspond- 
ing to K points falling in subcubes along the diagonal of the unit hypercube. 
Now randomly permute the entries in each column of the array. More precisely, 
let 71,...,7q be permutations of {1,..., K}, drawn independently from the 
distribution that makes all K! such permutations equally likely. Let 7; (72) de- 
note the value to which 7 is mapped by the jth permutation. The rows of the 


array 


mı(l mo(l Tall 
ya Vo aie 


yim (2) y72(2) FO y Tal) 
ai ] (4.55) 


par yo oe vo 


continue to identify points in [0,1)%, but they are no longer restricted to 
the diagonal. Indeed, each row is a point uniformly distributed over the unit 
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hypercube. The K points determined by the K rows are not independent: if 
we project the K points onto their ith coordinates, the resulting set of values 
ae AnA yo is the same as the set fy? ee P and thus forms 
a stratified sample from the unit interval. 

The “marginal” stratification property of Latin hypercube sampling is il- 
lustrated in Figure 4.8. The figure shows a sample of size K = 8 in dimension 
d = 2. Projecting the points onto either of their two coordinates shows that 
exactly one point falls in each of the eight bins into which each axis is par- 
titioned. Stratified sampling would require drawing a point from each square 
and thus a sample size of 64. 


0 
0 1 


Fig. 4.8. A Latin hypercube sample of size K = 8 in dimension d = 2. 


To generate a Latin hypercube sample of size K in dimension d, let UV ) 
be independent Unif[0,1) random variables for i = 1,...,d and j =1,...,K. 
Let 7,...,7q be independent random permutations of {1,..., K} and set 


(3) 
Tec) eC ai (2d: Fe ack. (4.56) 
K 
The sample consists of the K points (Vi, ee vi), j =1,..., K. To gener- 
ate a random permutation, first sample uniformly from {1,..., K}, then sam- 
ple uniformly from the remaining values, and continue until only one value 
remains. In (4.56) we may choose one of the permutations (mqa, say) to be the 
identity, mali) = i without affecting the joint distribution of the sample. 
Using the inverse transform method, this construction easily extends to 
nonuniform distributions. For example, to generate a Latin hypercube sample 
of size K from N(0,J) in RÉ, set 


Z) =o71(V,), i=1,...,d, p=1,...,K 
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with ® the cumulative normal and A% ) as in (4.56). The sample consists of 


the vectors 
ZO = (Z,...,2%), j=1,.., K; (4.57) 
in Rİ. Projecting these K points onto any axis produces a stratified sample of 
size K from the standard univariate normal distribution. Even if the inverse 
transform is inconvenient or if the marginals have different distributions, the 
construction in (4.55) continues to apply, provided we have some mechanism 
for stratifying the marginals: generate a stratified sample from each marginal 
using K equiprobable strata for each, then randomly permute these d stratified 
samples. 

This construction does rely crucially on the assumption of independent 
marginals, and transforming variables to introduce dependence can affect the 
partial stratification properties of Latin hypercube samples in complicated 
ways. This is evident in the case of a multivariate normal distribution N(0, X). 
To sample from N(0,%), we set X = AZ with Z ~ N(0,I) and AA' =». 
Replacing independently generated Zs with the Latin hypercube sample (4.57) 
produces points XY) = AZ], j = 1,...,K; but the marginals of the XY) 
so constructed will not in general be stratified. Rather, the marginals of the 
A-!X) are stratified. 


Example 4.4.1 Brownian paths. As a specific illustration, consider the sim- 
ulation of Brownian paths at times 0 = to < ti- < tg. As in (4.57), let 
Z® ..., ZŒ) denote a Latin hypercube sample from N (0,7) in d dimen- 
sions. From these K points in RŽ, we generate K discrete Brownian paths 


W®,...,WY) by setting 
W(t) => fi taZ, De een 
=l 


If we fix a time tn, n > 2, and examine the K values W®) (tn), ..., W“%) (tn), 
these will not form a stratified sample from N (0, tn). It is rather the incre- 
ments of the Brownian paths that would be stratified. 

These K Brownian paths could be used to generate K paths of a process 
driven by a single Brownian motion. It would not be appropriate to use 
the K Brownian paths to generate a single path of a process driven by a 
k-dimensional Brownian motion. Latin hypercube sampling introduces de- 
pendence between elements of the sample, whereas the coordinates of a K- 
dimensional (standard) Brownian motion are independent. Using (W™), ..., 
W*)) in place of a K-dimensional Brownian motion would thus change the 
law of the simulated process and could introduce severe bias. In contrast, the 
marginal law of each W) coincides with that of a scalar Brownian motion. 
Put succinctly, in implementing a variance reduction technique we are free to 
introduce dependence across paths but not within paths. O 


Example 4.4.2 Paths through a lattice. ‘To provide a rather different exam- 
ple, we apply Latin hypercube sampling to the problem of simulating paths 
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through a binomial lattice. (See Figure 4.7 and the surrounding discussion for 
background.) Consider an m-step lattice with fixed probabilities p and 1 — p. 
The “marginals” in this example correspond to the m time steps, so the di- 
mension d equals m, and the sample size K is the number of paths. We encode 
a move up as a 1 and a move down as a 0. For each i = 1,...,™m we generate 
a stratified sample of 1s and Os: we “generate” [pK | 1s and |K —pK | 0s, and 
if pK is not an integer, the Kth value in the sample is 1 with probability p 
and 0 with probability 1 — p. For example, with d = 4, K = 8, and p = 0.52, 
we might get 

1111 

1111 

1111 

1111 

0000 

0000 

0000 

0100 


the columns corresponding to the d = 4 stratified samples. Applying a random 
permutation to each column produces, e.g., 


0101 
0011 
1101 
1110 
0100 
1011 
0010 
1100 


Each row now encodes a path through the lattice. For example, the last row 
corresponds to two consecutive moves up followed by two consecutive moves 
down. Notice that, for each time step 2, the fraction of paths on which the 
ith step is a move up is very nearly p. This is the property enforced by Latin 
hypercube sampling. 

One could take this construction a step further to enforce the (nearly) 
correct fraction of up moves at each node rather than just at each time step. 
For simplicity, suppose K p@ is an integer. To the root node, assign Kp 1s and 
K(1 — p) 0s. To the node reached from the root by taking £ steps up and k 
steps down, £ + k < d, assign Kp*t+(1 — p)* ones and Kp*(1 — p)**? zeros. 
Randomly and independently permute the ones and zeros at all nodes. The 
result encodes K paths through the lattice with the fraction of up moves out 
of every node exactly equal to p. 

Mintz [269] develops a simple way to implement essentially the same idea. 
His implementation eliminates the need to precompute and permute the out- 
comes at all nodes. Instead, it assigns to each node a counter that keeps track 
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of the number of paths that have left that node by moving up. For example, 
consider again a node reached from the root by taking £ steps up and k steps 
down, +k < d. Let K be the total number of paths to be generated and again 
suppose for simplicity that Kpt is an integer. The number of paths reaching 
the designated node is 

Ker = Kp*(1—p)*, 


so the counter at that node counts from zero to pKgz, the number of paths 
that exit that node by moving up. If a path reaches the node and finds the 
counter at i, it moves down with probability i/pK,, and moves up with the 
complementary probability. If it moves up, the counter is incremented to 7+1. 
CJ 


Variance Reduction and Variance Decomposition 


We now state some properties of Latin hypercube sampling that shed light 
on its effectiveness. These properties are most easily stated in the context of 
sampling from [0,1)?. Thus, suppose our goal is to estimate 


Of =|. f(u) du 


for some square-integrable f :/0,1)¢ — R. The standard Monte Carlo estima- 
tor of this integral can be written as 


ka 
a 1 
ak > JU aaa U ate dta) 
=0 


with U1, U2,... independent uniforms. The variance of this estimator is 07/K 
with o? = Var[f(Ui,...,Ua)]: For V®,..., VCO as in (4.56), define the esti- 


mator 
K 


p I , 
=) 
gæl 
McKay et al. [259] show that 
o? K-1 
TE (1) (2) 
varléiy] = = + E corf ®), VO), 


which could be larger or smaller than the variance of the standard estimator 
Af, depending on the covariance between distinct points in the Latin hyper- 
cube sample. By construction, V? and V“*) avoid each other — for example, 
their ith coordinates cannot fall in the same bin if 7 Æ k — which suggests 
that the covariance will often be negative. This holds, in particular, if f is 
monotone in each coordinate, as shown by McKay et al. [259]. Proposition 3 
of Owen [288] shows that for any (square-integrable) f and any K > 2, 


a 
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2 


Oo 
Varl r| < 
ar[a | sS KIT’ 


so the variance produced by a Latin hypercube sample of size K is no larger 
than the variance produced by an i.i.d. sample of size K — 1. 

Stein [337] shows that as K — oo, Latin hypercube sampling eliminates 
the variance due to the additive part of f, in a sense we now explain. For each 
1 lerede let 


filu) = E| f (U1, ERT Uii ü, Us ome Ua), 


for u € [0,1). Observe that each f;(U), U ~ Unif[0,1) has expectation af. 
The function 


d 
faaalu,- sta) = > filui) — (d — larg 
i= 


also has expectation a; and is the best additive approximation to f in the 
sense that 


[ \¢ (flui, ed , Ud) = faaa (tı, T ud))” du 2% - dud 
il 


J 2 
<f flur... ua)— X hilu) ] dur- dua 
[0,1)¢ i=l 
for any univariate functions hy,..., hq. Moreover, the residual 


62 E E Sf ed CO tee UG) 


is uncorrelated with f(U1,...,Uq) and this allows us to decompose the vari- 
ance o° of f(Ui,...,Ua) as o? = ofja + 0? with o7,, the variance of 
faaa(U1,...,Uq) and o? the variance of the residual. Stein [337] showed that 


g? 
Var[â p] = K +o(1/K). (4.58) 


Up to terms of order 1/K, Latin hypercube sampling eliminates o2,, — the 
variance due to the additive part of f — from the simulation variance. This 
further indicates that Latin hypercube sampling is most effective with inte- 
grands that nearly separate into a sum of one-dimensional functions. 


Output Analysis 


Under various additional conditions on f, Loh [238], Owen [285], and Stein 
[337] establish a central limit theorem for Y of the form 


VK [az — as] > N(0, 02), 
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which in principle provides the basis for a large-sample confidence interval for 
ay based on Latin hypercube sampling. In practice, o? is neither known nor 
easily estimated, making this approach difficult to apply. 

A simpler approach to interval estimation generates i.i.d. estimators 
G@yz(1),...,@¢(m), each based on a Latin hypercube sample of size K. An 


asymptotically (as n — oo) valid 1 — 6 confidence interval for ay is provided 
by 


: SO EA E 
ny var 
with § the sample standard deviation of @;(1),...,@ (n). 

The only cost to this approach lies in foregoing the possibly greater vari- 
ance reduction from generating a single Latin hypercube sample of size nK 
rather than n independent samples of size K. Stein [337] states that this loss 
is small if K/d is large. 

A K x K array is called a Latin square if each of the symbols 1,..., K 
appears exactly once in each row and column. This helps explain the name 
“Latin hypercube sampling.” Latin squares are used in the design of exper- 
iments, along with the more general concept of an orthogonal array. Owen 
[286] extends Stein’s [337] approach to analyze the variance of Monte Carlo 
estimates based on randomized orthogonal arrays. This method generalizes 
Latin hypercube sampling by stratifying low-dimensional (but not just one- 
dimensional) marginal distributions. 


Numerical [llustration 


We conclude this section with a numerical example. We apply Latin hypercube 
sampling to the pricing of two types of path-dependent options — an Asian 
option and a barrier option. The Asian option is a call on the arithmetic 
average of the underlying asset over a finite set of dates; the barrier option 
is a down-and-out call with a discretely monitored barrier. The underlying 
asset is GBM(r, o°) with r = 5%, ø = 0.30, and an initial value of 50. The 
barrier is fixed at 40. The option maturity is one year in all cases. We report 
results for 8 and 32 equally spaced monitoring dates; the number of dates is 
the dimension of the problem. With d monitoring dates, we may view each 
discounted option payoff as a function of a standard normal random vector in 
R? and apply Latin hypercube sampling to generate these vectors. 

Table 4.3 reports estimated variance reduction factors. Each entry in the 
table is an estimate of the ratio of the variance using independent sampling 
to the variance using a Latin hypercube sample of the same size. ‘Thus, larger 
ratios indicate greater variance reduction. The sample sizes displayed are 50, 
200, and 800. The ratios are estimated based on 1000 replications of samples 
of the indicated sizes. 

The most salient feature of the results in Table 4.3 is the effect of varying 
the strike: in all cases, the variance ratio increases as the strike decreases. This 
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is to be expected because at lower strikes the options are more nearly linear. 
The variance ratios are nearly the same in dimensions 8 and 32 and show little 
dependence on the sample size. We know that the variance of independent 
replications (the numerators in these ratios) are inversely proportional to the 
sample sizes. Because the ratios are roughly constant across sample sizes, we 
may conclude that the variance using Latin hypercube sampling is nearly 
inversely proportional to the sample size. This suggests that (at least in these 
examples) the asymptotic result in (4.58) is relevant for sample sizes as small 
as K = 50. 


8 steps ` 32 steps 

Strike 50 200 800 50 200 800 
Asian 45 7.0 86 8.8 7.1 7.6 8.2 
Option 50 3.9 44 4.6 3.7 3.6 4.0 
55 24 26 2.8 23 2.1 25 

Barrier 45 41 41 43 3.8 3.7 3.9 
Option 50 32 o2 3A 30 29 3.1 
55 2.5 26 2.7 2.4 2.2 2.4 


Table 4.3. Variance reduction factors using Latin hypercube sampling for two path- 
dependent options. Results are displayed for dimensions (number of monitoring 
dates) 8 and 32 using samples of size 50, 200, and 800. Each entry in the table 
is estimated from 1000 replications, each replication consisting of 50, 200, or 800 


paths. 


The improvements reported in Table 4.3 are mostly modest. Similar vari- 
ance ratios could be obtained by using the underlying asset as a control vari- 
ate; for the Asian option, far greater variance reduction could be obtained 
by using a geometric average control variate as described in Example 4.1.2. 
One potential advantage of Latin hypercube sampling is that it lends itself 
to the use of a single set of paths to price many different types of options. 
The marginal stratification feature of Latin hypercube sampling is beneficial 
in pricing many different options, whereas control variates are ideally tailored 
to a specific application. 


4.5 Matching Underlying Assets 


This section discusses a set of loosely related techniques with the common 
objective of ensuring that certain sample means produced in a simulation ex- 
actly coincide with their population values (i.e., with the values that would 
be attained in the limit of infinitely many replications). Although these tech- 
niques could be used in almost any application of Monte Carlo, they take on 
special significance in financial engineering where matching sample and pop- 
ulation means will often translate to ensuring exact finite-sample pricing of 
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underlying assets. The goal of derivatives pricing is to determine the value of a 
derivative security relative to its underlying assets. One could therefore argue 
that correct pricing of these underlying assets is a prerequisite for accurate 
valuation of derivatives. 

The methods we discuss are closely related to control variates, which 
should not be surprising since we noted (in Example 4.1.1) that underlying 
assets often provide convenient controls. There is also a link with stratified 
sampling: stratification with proportional allocation ensures that the sam- 
ple means of the stratum indicator functions coincide with their population 
means. We develop two types of methods: moment matching based on trans- 
formations of simulated paths, and methods that weight (but do not trans- 
form) paths in order to match moments. When compared with control variates 
or with each other, these methods may produce rather different small-sample 
properties while becoming equivalent as the number of samples grows. This 
makes it difficult to compare estimators on theoretical grounds. 


4.5.1 Moment Matching Through Path Adjustments 


The idea of transforming paths to match moments is most easily introduced 
in the setting of a single underlying asset S(t) simulated under a risk-neutral 
measure in a model with constant interest rate r. If the asset pays no divi- 
dends, we know that E[S(t)] = e"*S(0). Suppose we simulate n independent 
copies S1,...,5, of the process and define the sample mean process 


S(t) = k De S(t). 


For finite n, the sample mean will not in general coincide with E[S(t)]; the 
simulation could be said to misprice the underlying asset in the sense that 


e S(t) Æ S(0), (4.59) 


the right side being the current price of the asset and the left side its simulation 


estimate. 
A possible remedy is to transform the simulated paths by setting 


409 = 98 E, i=... aao) 


or 

S,(t) = S(t) + E[S(t)] — S(t), i= 1,...,n, (4.61) 

and then using the S; rather than the S; to price derivatives. Using either the 

multiplicative adjustment (4.60) or the additive adjustment (4.61) ensures 
that the sample mean of 51 (t), ... , Sn (t) exactly equals E[S(t)]. 

These and related transformations are proposed and tested in Barraquand 

[37], Boyle et al. [53], and Duan and Simonato [96]. Duan and Simonato call 
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(4.60) empirical martingale simulation; Boyle et al. use the name moment 
matching. In other application domains, Hall [164] analyzes a related cen- 
tering technique for bootstrap simulation and Gentle [136] refers briefly to 
constrained sampling. In many settings, making numerical adjustments to 
samples seems unnatural — some discrepancy between the sample and popu- 
lation mean is to be expected, after all. In the financial context, the error in 
(4.59) could be viewed as exposing the user to arbitrage through mispricing 
and this might justify attempts to remove the error completely. 

A further consequence of matching the sample and population mean of 
the underlying asset is a finite-sample form of put-call parity. The algebraic 
identity 

(a—b)* — (b-a) =a—b 


implies the constraint 
eT El(S(T) — K)*] —e TT El(K — S(T))\*] = 8(0) — e" K 


on the values of a call, a put, and the underlying asset. Any adjustment that 
equates the sample mean of S;(T),...,5,(7') to E{S(T)] ensures that 


—rT 1 =e + —rT l . O + —rT 
a (Si) i) ea LK S,(T))+ = S(0) -—e77? K. 
This, too, may be viewed as a type of finite-sample no-arbitrage condition. 

Of (4.60) and (4.61), the multiplicative adjustment (4.60) seems preferable 
on the grounds that it preserves positivity whereas the additive adjustment 
(4.61) can make some S$; negative even if S;(t),..., Sn(t), E[S(t)] are all pos- 
itive. However, we get E[.S;(t)] = E[S(t)] using (4.61) but not with (4.60). 
Indeed, (4.61) even preserves the martingale property in the sense that 


Efe "7-9 §.(T)|S;(u),0 < u < t) = Sit). 


Both (4.60) and (4.61) change the law of the simulated process (8; (t) and S: (t) 
will not in general have the same distribution) and thus typically introduce 
some bias in estimates computed from the adjusted paths. This bias vanishes 
as the sample size n increases and is typically O(1/n). 


Large-Sample Properties 


There is some similarity between the transformations in (4.60) and (4.61) and 
the nonlinear control variates discussed in Section 4.1.4. The current setting 
does not quite fit the formulation in Section 4.1.4 because the adjustments 
here affect individual observations and not just their means. 

To extend the analysis in Section 4.1.4, we formulate the problem as one 
of estimating E[kı(X)] with X taking values in R? and hı mapping R? into 
R. For example, we might have X = S(T) and h(x) = e~"? (x — K)* in the 
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case of pricing a standard call option. The moment matching estimator has 


the form 3 
1 z 
— h( Xi, X 
oe" ) 


with X1,..., Xn iid. and X their sample mean. The function h is required to 
satisfy h(x, ux) = hı(x) with ux = E[X]. It is easy to see that an estimator 


of the form : 
1 Pol 
-rT + 
= S(T) — Kk)", 
ot GD) - K) 


with 5; as in (4.60) or (4.61), fits in this framework. Notice, also, that by 
including in the vector X powers of other components of X, we make this 
formulation sufficiently general to include matching higher-order moments as 
well as the mean. 

Suppose now that h(X;,-) is almost surely continuously differentiable in a 


neighborhood of ux. Then 


ba R bo la E 
a i) ae 4 aa i) = ; 4. 
S 3 h(X;, X) A ) hy (Xi) + k ) Vp h(X ux) X — ux] (4.62) 


į=1 i=l 


with V„h denoting the gradient of h with respect to its second argument. 
Because X — px, this approximation becomes increasingly accurate as n 
increases. This suggests that, asymptotically in n, the moment matching 


estimator is equivalent to a control variate estimator with control X and 
coefficient vector 


1 n 
bl = NO Vah(Xi, ux) > E [V h(X, px )]. (4.63) 
i=l 


Some specific results in this direction are established in Duan, Gauthier, and 
Simonato [97] and Hall [164]. However, even under conditions that make this 
argument rigorous, the moment matching estimator may perform either better 
or worse in small samples than the approximating control variate estimator. 

The dependence among the observations h(X;, X), i = 1,...,n, introduced 
through use of a common sample mean X complicates output analysis. One 
approach proceeds as though the approximation in (4.62) held exactly and 
estimates a confidence interval the way one would with a linear control variate 
(cf. Sections 4.1.1 and 4.1.3). An alternative is to generate k independent 
batches, each of size m, and to apply moment matching separately to each 
batch of m paths. A confidence interval can then be formed from the sample 
mean and sample standard deviation of the k means computed from the k 
batches. As with stratified sampling or Latin hypercube sampling, the cost of 
batching lies in foregoing potentially greater variance reduction by applying 
the method to all km paths. 
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Examples 


We turn now to some more specific examples of moment matching transfor- 
mations. 
Example 4.5.1 Brownian motion and geometric Brownian motion. In the 
case of a standard one-dimensional Brownian motion W, the additive trans- 
formation i 

Wilt) = Wi(t) — W(t) 


seems the most natural way to match the sample and population means — 
there is no reason to try to avoid negative values of W;, and the mean of a 
normal distribution is a location parameter. The transformation 


=, _ W,(t) - W(t) 
Wilt) = (4.64) 


with s(t) the sample standard deviation of W;(t),...,W,»(t), matches both 
first and second moments. But for this it seems preferable to scale the incre- 
ments of the Brownian motion: with 


k 
= 2, Vt ti bot Zij 


and {Z;;} independent N(0,1) random variables, set 


and 
- : AER 
Wilt) =$ vti- tji w 
j=l 


This transformation preserves the independence of increments whereas (4.64) 
does not. 

For geometric Brownian motion S ~ GBM(r, o°), the multiplicative trans- 
formation (4.60) is more natural. It reduces to 


oW;(t 
7 nele 


PO wp ners) 
Silt) = S(O)e =, Ak 
This transformation does not lend itself as easily to matching higher moments. 

As a simple illustration, we apply these transformations to the pricing of 
a call option under Black-Scholes assumptions and compare results in Fig- 
ure 4.9. An ordinary simulation generates replications of the terminal asset 


price using 
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S;(T) = S(0) exp ((r vues oVT Zi) m e 


Method Z1 replaces each Z; with Z; — Z and method Z2 uses (Z; — Z)/s, 
with Z the sample mean and s the sample standard deviation of Z1,..., Zn. 
Methods SM and SA use the multiplicative and additive adjustments (4.60) 
and (4.61). Method CV uses S$ as a control variate with the optimal coefficient 
estimated as in (4.5). 

Figure 4.9 compares estimates of the absolute bias and standard error for 
these methods in small samples of size n = 16, 64, 256, and 1024. The model 
parameters are S(0) = K = 50, r = 5%, o = 0.30, and the option expiration 
is T = 1. The results are based on 5000 replications of each sample size. The 
graphs in Figure 4.9 are on log-log scales; the slopes in the top panel are 
consistent with a O(1/n) bias for each method and those in the bottom panel 
are consistent with O(1/,/n) standard errors. Also, the biases are about an 
order of magnitude smaller than the standard errors. 

In this example, the control variate estimator has the highest bias — recall 
that bias in this method results from estimation of the optimal coefficient — 
though the bias is quite small at n = 1024. Interestingly, the standard errors 
for the CV and SM methods are virtually indistinguishable. This suggests 
that the implicit coefficient (4.63) in the linear approximation to the multi- 
plicative adjustment coincides with the optimal coefficient. The CV and SM 
methods achieve somewhat smaller standard errors than SA and Z1, which 
may be considered suboptimal control variate estimators. ‘The lowest variance 
is attained by Z2, but because this method adjusts both a sample mean and a 
sample variance, it should be compared with a control variate estimator using 
two controls. (Compared with ordinary simulation, the Z2 reduces variance 
by a factor of about 40; the other methods reduce variance by factors ranging 
from about 3 to 7.) 

These results suggest that moment matching estimators are indeed closely 
related to control variate estimators. ‘They can sometimes serve as an indirect 
way of implementing a control, potentially providing much of the variance 
reduction while reducing small-sample bias. Though we observe this in Fig- 
ure 4.9, there is of course no guarantee that the same would hold in other 


examples. O 

Example 4.5.2 Short rate models. We consider, next, finite-sample adjust- 
ments to short rate processes with the objective of matching bond prices. We 
begin with an idealized setting of continuous-time simulation of a short rate 


process r(t) under the risk-neutral measure. From independent replications 
T1,.--,Tn Of the process, suppose we can compute estimated bond prices 


B(0,T) = LS exp (- [ ri(t) a) l (4.65) 


From these we define empirical forward rates 
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Fig. 4.9. Bias (top) and standard error (bottom) versus sample size in pricing a 
standard call option under Black-Scholes assumptions. The graphs compare moment 
matching based on the mean of Z; (Z1), the mean and standard deviation of Z; (Z2), 
multiplicative (SM) and additive (SA) adjustments based on S, and an estimator 


using 5 as linear control variate (CV). 


f(0,T) = = log B(0, T). (4.66) 


The model bond prices and forward rates are 


B(0,T) =E e (- [ r(t) a) 


and 
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f(0,T) =- i log B(0, T) 
’ ~~ OT 8 ’ 5 


The adjustment : 
Filt) =z ri(t) af f (0, t) = f (0, t) 


1 n i 7 . 
a (- / nto a) = B(0,T); 


i.e., in exact pricing of bonds in finite samples. 
Suppose, now, that we can simulate r exactly but only at discrete dates 
0 = to,t1,...,tm, and our bond price estimates from n replications are given 


by 


results in 


n m— i1 

2 1 

B(0,tm) = - ` exp | — ` rifti )[tj+1 — ty] 
i=l WO 


The yield adjustment 


4 log B(0,t;+1) — log B(0,t;) _ log B(0,t;41) — log B(0,t;) 


T; t- = t- 
(t) = nalts) epee’ “gee 


results in 


n m— 1 
1 5 
z 3 exp | — 2 Ti iai = t; = B(Osta,): 


In this case, the adjustment corrects for discretization error as well as sampling 
variability. O 


Example 4.5.3 HJM framework. Consider a continuous-time HJM model of 
the evolution of the forward curve f(t,7), as in Section 3.6, and suppose we 
can simulate independent replications f,,..., fn of the paths of the curve. 
Because f (t,t) is the short rate at time t, the previous example suggests the 
adjustment f;(t,t) = filt, t) + f(0,t) — f(0,t) with f as in (4.66) and B(0,T) 
as in (4.65) but with r;(t) replaced by fj(t,t). But the HJM setting provides 
additional flexibility to match additional moments by adjusting other rates 


f(t, u). 
We know that for any 0 < t < T, 


E ep (- [few du — [ f(t, u) du) 


this follows from the fact that discounted bond prices are martingales and is 
nearly the defining property of the HJM framework. We choose f to enforce 
the finite-sample analog of this property, namely 


i ae T 
Pa (~ / fi(u, u) du- | filt, u) au) = B(0, T): 


= B(0, T); 
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This is accomplished by setting 


A(t,T) = filt, T) + f(0,T) — a a eee 


H,(t,T) = exp (- [ filu, u) du — [ fi(t, u) . 


In practice, one would simulate on a discrete grid of times and maturities as 
explained in Section 3.6.2 and this necessitates some modification. However, 
using the discretization in Section 3.6.2, the discrete discounted bond prices 
are martingales and thus lend themselves to a similar adjustment. O 


with 


Example 4.5.4 Normal random vectors. For i.i.d. normal random vectors, 
centering by the sample mean is equivalent to sampling conditional on the 
sample mean equaling the population mean. To see this, let X1,..., Xn be 
independent N (u, ®©) random vectors. The adjusted vectors 


Xi, =X;-X+y 
have mean u; moreover, they are jointly normal with 


(n-1)u/n  —-—/n hed —Li/n 


Xi ‘ 3 1)x 
oan [i| -2 -D2 
Xx m : . —Li/n 
—Xi/n —Li/n (n—1)0/n 
as can be verified using the Linear ‘Transformation Property (2.23). But this 
is also the joint distribution of X1,..., Xn given X = u, as can be verified 


using the Conditioning Formula (2.25). O 


4.5.2 Weighted Monte Carlo 


An alternative approach to matching underlying prices in finite samples as- 
signs weights to the paths instead of shifting them. The weights are chosen 
to equate weighted averages over the paths to the corresponding population 
means. 

Consider, again, the setting surrounding (4.59) with which we introduced 
the idea of moment matching. Suppose we want to price an option on the 
underlying asset S(t); we note that the simulation misprices the underlying 
asset in finite samples in the sense that the sample mean S(t) deviates from 
e™'S(0). Rather than change the simulated values Si(t),...,Sn(t), we may 
choose weights wj,..., Wn satisfying 


N wiSi(t) = e"’S(0), 
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and then use these same weights in estimating the expected payoff of an 
option. For example, this yields the estimate 


erm 2 w;(Si(t) — K)t 


for the price of a call struck at K. 

The method can be formulated more generically as follows. Suppose we 
want to estimate E[Y] and we know the mean vector ux = E[X]| for some 
random d-vector X. For example, X might record the prices of underlying 
assets at future dates, powers of those prices, or the discounted payofts of 
tractable options. Suppose we know how to simulate i.i.d. replications (X;, Y;), 


1=1,...,n, of the pair (X,Y ). In order to match the known mean px in finite 
samples, we choose weights w1,..., Wn satisfying 
n 
N wX; = ux (4.67) 
i=1 


and then use 


n 
S UN, (4.68) 
i=1 
to estimate E[Y]. We may also want to require that the weights sum to 1: 


a = 1, (4.69) 
=l 


We can include this in (4.67) by taking one of the components of the X; to 
be identically equal to 1. 

The number of constraints d is typically much smaller than the number of 
replications n, so (4.67) does not determine the weights. We choose a partic- 
ular set of weights by selecting an objective function H : R — R and solving 


the constrained optimization problem 
min H(w1,..., Wn) subject to (4.67). (4.70) 


Because the replications are i.i.d., it is natural to restrict attention to functions 
H that are convex and symmetric in w1, ..., Wn; the criterion in (4.70) then 
penalizes deviations from uniformity in the weights. 

An approach of this type was proposed by Avellaneda et al. [26, 27], but 
more as a mechanism for model correction than variance reduction. In their 
setting, the constraint (4.67) uses a vector Ho (to be interpreted as market 
prices) different from uy (the model prices). The weights thus serve to cali- 
brate an imperfect model to the observed market prices of actively traded in- 
struments in order to more accurately price a less liquid instrument. Broadie, 
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Glasserman, and Ha [68] use a related technique for pricing American op- 
tions; we return to this in Chapter 8. It should be evident from the discussion 
leading to (4.68) and (4.70) that there is a close connection between this ap- 
proach and using X as a control variate; we make the connection explicit in 
Example 4.5.6. First we treat the objective considered by Avellaneda et al. 


(26, 27]. 


Example 4.5.5 Maximum entropy weights. A particularly interesting and in 
some respects convenient objective H is the (negative) entropy function 


n 
H (Uis Wn) = X wi log wy, 
i=1 


which we take to be +00 if any w; is negative. (By convention, 0-log0 = 0.) 
Using this objective will always produce positive weights, provided there is a 
positive feasible solution to (4.67). Such a solution exists whenever the convex 
hull of the points X1,..., Xn contains ux, and this almost surely occurs for 


sufficiently large n. 
We can solve for the optimal weights by first forming the Lagrangian 


n n n 
) wi log wi — v ) ee ) Wi Xi. 
i=1 i=1 i=1 


Here, v is a scalar and A is a d-vector. For this objective it turns out to be 
convenient to separate the constraint on the weight sum, as we have here. 
Setting the derivative with respect to w; equal to zero and solving for wi 


yields 
Wi = d exp(à! X;). 


Constraining the weights to sum to unity yields 


= expl X;) (4.71) 


Wi = 7 : 
Ža exp(à! Xj) 
The vector A is then determined by the condition 


i 


which can be solved numerically. 

Viewed as a probability distribution on {X,,..., Xn}, the (w1,..., Wn) 
in (4.71) corresponds to an exponential change of measure applied to the 
uniform distribution (1/n,...,1/n), in a sense to be further developed in 
Section 4.6. The solution in (4.71) may be viewed as the minimal adjustment 
to the uniform distribution needed to satisfy the constraints (4.67)—-(4.69). O 


= PX, 
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Example 4.5.6 Least-squares weights. The simplest objective to consider in 


(4.70) is the quadratic 

H(wi;.--,Wn) = 4w! w, 
with w the vector of weights (w1,..., wn)! . We will show that the estimator 
w! Y = $` wY; produced by these weights is identical to the control variate 
estimator Y(b,) defined by (4.16). 

Define an n x (d+ 1) matrix A whose ith row is (1, X;! — wy). (Here 
we assume that X1,...,X, do not contain an entry identically equal to 1.) 
Constraints (4.67)—(4.69) can be expressed as w' A = (1,0), where 0 is a row 
vector of d zeros. The Lagrangian becomes 


Sw lw +w! Ar 
with A € R¢. The first-order conditions are w = —AX. From the constraint 
we get 
(1,0)=w'A=-d'A'TAS —A' =(1,0)(A'A)™ 
=> wl = (1,0)(A'A)'A', 


assuming the matrix A has full rank. The weighted Monte Carlo estimator of 
the expectation of the Y; is thus 


w' Y = (1,0 ATA) ATY, Y=(%,...,Y,)!. (4.72) 


The control variate estimator is the first entry of the vector 6 € R7*! that 
solves 


min 3(Y — AB)" (Y — Ap); 


i.e., it is the value fitted at (1, x) in a regression of the Y; against the rows of 
A. From the first-order conditions (Y — 4)! A = 0, we find that the optimal 
Gis (A! A)~1A'Y. The control variate estimator is therefore 


(1,0)6 = (1,0)(A' A) A'Y, 


which coincides with (4.72). When written out explicitly, the weights in (4.72) 
take precisely the form displayed in (4.20), where we first noted the interpre- 
tation of a control variate estimator as a weighted Monte Carlo estimator. 


O 


This link between the general strategy in (4.68) and(4.70) for construct- 
ing “moment-matched” estimators and the more familiar method of control 
variates suggests that (4.68) provides at best a small refinement of the con- 
trol variate estimator. As the sample size n increases, the refinement typically 
vanishes and using knowledge of x as a constraint in (4.67) becomes equiv- 
alent to using it in a control variate estimator. A precise result to this effect 
is proved in Glasserman and Yu [147]. This argues in favor of using control 
variate estimators rather than (4.68), because they are easier to implement 
and because more is known about their sampling properties. 
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4.6 Importance Sampling 


4.6.1 Principles and First Examples 


Importance sampling attempts to reduce variance by changing the probability 
measure from which paths are generated. Changing measures is a standard 
tool in financial mathematics; we encountered it in our discussion of pricing 
principles in Section 1.2.2 and several places in Chapter 3 in the guise of 
changing numeraire. Appendix B.4 ‘reviews some of the underlying mathe- 
matical theory. When we switch from, say, the objective probability measure 
to the risk-neutral measure, our goal is usually to obtain a more convenient 
representation of an expected value. In importance sampling, we change mea- 
sures to try to give more weight to “important” outcomes thereby increasing 


sampling efficiency. 
To make this idea concrete, consider the problem of estimating 


E ele / ROOTS 


where X is a random element of R? with probability density f, and h is a 
function from R€ to R. The ordinary Monte Carlo estimator is 


â= a(n) = = OW) 


with X1,..., Xn independent draws from f. Let g be any other probability 
density on R? satisfying 


f(x) >0=> g(x) >0 (4.73) 


for all x € R. Then we can alternatively represent a as 


f(z) 
a= | h(x)——~g(x) de. 
/ g(x) 
This integral can be interpreted as an expectation with respect to the density 
g; we may therefore write 


a=E hook] (4.74) 


E here indicating that the expectation is taken with X distributed according to 
g. If X1,..., Xn are now independent draws from g, the importance sampling 
estimator associated with g is 


i as n S(X:) 
ig = alm) = MIE (4.75) 
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The weight f(X;)/g(X;) is the likelihood ratio or Radon-Nikodym derivative 
evaluated at X;. i 

It follows from (4.74) that E[ļâ,] = a and thus that â is an unbiased 
estimator of a. To compare variances with and without importance sampling 
it therefore suffices to compare second moments. With importance sampling, 


we have 
E (mon) | =E xp | l 


This could be larger or smaller than the second moment E[h(X)?] without 
importance sampling; indeed, depending on the choice of g it might even be 
infinitely larger or smaller. Successful importance sampling lies in the art of 
selecting an effective importance sampling density g. 

Consider the special case in which h is nonnegative. The product h(x) f(x) 
is then also nonnegative and may be normalized to a probability density. 
Suppose g is this density. Then 


g(x) x h(x) f(x), (4.76) 


and h(X;) f (X;)/g(X;) equals the constant of proportionality in (4.76) regard- 
less of the value of X;; thus, the importance sampling estimator â; in (4.75) 
provides a zero-variance estimator in this case. Of course, this is useless in 
practice: to normalize h- f we need to divide it by its integral, which is a; the 
zero-variance estimator is just qa itself. 

Nevertheless, this optimal choice of g does provide some useful guidance: in 
designing an effective importance sampling strategy, we should try to sample 
in proportion to the product of h and f. In option pricing applications, h 
is typically a discounted payoff and f is the risk-neutral density of a discrete 
path of underlying assets. In this case, the “importance” of a path is measured 
by the product of its discounted payoff and its probability density. 

If h is the indicator function of a set, then the optimal importance sampling 
density is the original density conditioned on the set. In more detail, suppose 
h(x) = 1{x € A} for some A C R°. Then a = P(X € A) and the zero-variance 
importance sampling density h(x) f(x)/a is precisely the conditional density 
of X given X € A (assuming a > 0). Thus, in applying importance sampling 
to estimate a probability, we should look for an importance sampling density 
that approximates the conditional density. This means choosing g to make 
the event {X € A} more likely, especially if A is a rare set under f. 


Likelihood Ratios 


In our discussion thus far we have assumed, for simplicity, that X is #¢-valued, 
but the ideas extend to X taking values in more general sets. Also, we have 
assumed that X has a density f, but the same observations apply if f is a 
probability mass function (or, more generally, a density with respect to some 
reference measure on RÊ. possibly different. fram Tehesene meacnre) 


4.6 Importance Sampling 257 


For option pricing applications, it is natural to think of X as a discrete 
path of underlying assets. The density of a path (if one exists) is ordinarily not 
specified directly, but rather built from more primitive elements. Consider, for 
example, a discrete path S(t;), i = 0,1,...,m, of underlying assets or state 
variables, and suppose that this process is Markov. Suppose the conditional 
distribution of S(t;) given S(t;-1) = x has density f;(x,-). Consider a change 
of measure under which the transition densities f; are replaced with transition 
densities g;. The likelihood ratio for this change of measure is 


fils ti- 5 S(t (ti)) 
Hi me 1), S(ti)) 


More precisely, if E denotes expectation under the original measure and E 
denotes expectation under the new measure, then 


Ente le = Ë [Stu ff SS]. 
(4.77) 


for all functions h for which the expectation on the left exists and is finite. 
Here we have implicitly assumed that S(to) is a constant. More generally, 
we could allow it to have density fo under the original measure and den- 
sity go under the new measure. This would result in an additional factor of 
fo(S(to))/g(S(to)) in the likelihood ratio. 
We often simulate a path S(to),..., (tm) through a recursion of the form 


S(ti+1) = G(S(ti), Xi41), (4.78) 


driven by i.i.d. random vectors X1, X2,..., Xm. Many of the examples con- 
sidered in Chapter 3 can be put in this form. The X; will often be normally 
distributed, but for now let us simply assume they have common density f. If 
we apply a change of measure that preserves the independence of the X; but 
changes their common density to g, then the corresponding likelihood ratio is 


This means that 


E{h(S(t1),...,$(tm))] = Ë MSE) SEn) T EFS (4.79) 


where, again, E and E denote expectation under the original and new mea- 
sures, respectively, and the expectation on the left is assumed finite. Equation 
(4.79) relies on the fact that S(t1),..., S(tm) are functions of X1,..., Xm. 
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Random Horizon 


Identities (4.77) and (4.79) extend from a fixed number of steps m to a ran- 
dom number of steps, provided the random horizon is a stopping time. We 
demonstrate this in the case of i.i.d. inputs, as in (4.79). For each n = 1,2,.. 
let hn be a function of n arguments and suppose we want to estimate 


E[hn(S(t1),---,S(tw))], (4.80) 


with N a random variable taking values in {1,2,...}. For example, in the case 
of a barrier option with barrier b, we might define N to be the index of the 
smallest t; for which S(t;) > b, taking N = m if all S(to),..., S(tm) lie below 
the barrier. We could then express the discounted payoff Ke an up-and-out put 
as hn (S(t1), T S(tĒN)) with 


—Ttm + E 
The option price then has the form (4.80). 

Suppose that (4.78) holds and, as before, E denotes expectation when the 
X; are 1.i.d. with density f and E denotes expectation when they are i.i.d. with 
density g. For concreteness, suppose that S(to) is fixed under both measures. 
Let N be a stopping time for the sequence X1, X2,...; for example, N could 
be a stopping time for S(t1), S(t2),... as in the barrier option example. Then 


E[hn(S(t1),.-.,S(tnw))ILN < co} 


= E |hn(S(t1), o) TE < oo) 


provided the expectation on the left is finite. This identity (sometimes called 
Wald’s identity or the fundamental identity of sequential analysis — see, e.g., 
Asmussen [20]) is established as follows: 


Elan (S(t1),...,S(tw))1{N < o0} 


= YO En (S (t), Sn) HN = nY 


I 
Me 
Th 


a s(t oma Heal 


=E Hv < so). (4.81) 


hn (S(t1),---,S(tw)) Way 


= 19 


porre ee 


The second equality uses the stopping time property: because N is a stopping 
time the event {N = n} is determined by X1,..., Xn and this allows us to 
apply (4.79) to each term in the infinite sum. It is entirely possible for the 
event {N < oo} to have probability 1 under one of the measures but not the 
other; we will see an example of this in Example 4.6.3. 
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Long Horizon 


We continue to consider two probability measures under which random vectors 
X1,X2,... are iid., P giving the X; density f, P giving them density g. 

It should be noted that even if f and g are mutually absolutely continuous, 
the probability measures P and P will not be. Rather, absolute continuity 
holds for the restrictions of these measures to events defined by a finite initial 
segment of the infinite sequence. For A C #2, the event 


_ ile 
| im, a È l{X; € A} = Jte a) 
= 
has probability 1 under P; but some such event must have probability 0 under 
P unless f and g are equal almost everywhere. In short, the strong law of large 
numbers forces P and P to disagree about which events have probability 0. 
This collapse of absolute continuity in the limit is reflected in the somewhat 


pathological behavior of the likelihood ratio as the number of terms grows, 
through an argument from Glynn and Iglehart [157]. Suppose that 


E [| log(f(X1)/9(X1))|] < 00; 


then the strong law of large numbers implies that 
L : 
as > log(f(X:)/9(X:)) > E flog(f(X1)/9(%1))] = c (4.82) 
i=l 
with probability 1 under P. By Jensen’s inequality, 
< log E[f(X1)/g(X1)] =1 a es 
© < log ELf(X1)/a(%)] = log | ala) de = 0 


with strict inequality unless P(f(X1) = g(X1)) = 1 because log is strictly 
concave. But if c < 0, (4.82) implies 


X` log(f(X:)/9(Xi)) > —0o; 
i=1 
exponentiating, we find that 


— 0 


m X; 
oe ) 


with P-probability 1. Thus, the likelihood ratio converges to 0 though its 
expectation equals 1 for all m. This indicates that the likelihood ratio becomes 
highly skewed, taking increasingly large values with small but non-negligible 
probability. This in turn can result in a large increase in variance if the change 
of measure is not chosen carefully. 
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Output Analysis 


An importance sampling estimator does not introduce dependence between 
replications and is just an average of i.i.d. replications. We can therefore sup- 
plement an importance sampling estimator with a large-sample confidence 
interval in the usual way by calculating the sample standard deviation across 
replications and using it in (A.6). Because likelihood ratios are often highly 
skewed, the sample standard deviation will often underestimate the true stan- 
dard deviation, and a very large sample size may be required for confidence 
intervals based on the central limit theorem to provide reasonable coverage. 
These features should be kept in mind in comparing importance sampling 
estimators based on estimates of their standard errors. 


Examples 


Example 4.6.1 Normal distribution: change of mean. Let f be the univariate 
standard normal density and g the univariate normal density with mean u and 
variance 1. Then simple algebra shows that 


A bit more generally, if we let g; have mean p,;, then 
I] nea = exp | — `S Uili + ‘ y TAF (4.83) 
EAR i=1 i=1 


If we simulate Brownian motion on a grid 0 = tọ < ti <--- < tm by setting 


W (tn) = bp li == ee ae 
i=l 


then (4.83) is the likelihood ratio for a change of measure that adds mean 
Hiy/ti — ti—1 to the Brownian increment over [ti—1, ti]. 0 


Example 4.6.2 Exponential change of measure. The previous example is a 
special case of a more general class of convenient measure transformations. 


For a cumulative distribution function F on R, define 


w(@) = log a e°? dF (x). 


— 00 


This is the cumulant generating function of F, the logarithm of the moment 
generating function of F. Let © = {0 : (0) < co} and suppose that © is 
nonempty. For each 8 € O, set 
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T 
Folz) = / eft V9) dF(u); 
— co 


each Fy is a probability distribution, and {Fọ,0 € O} form an exponential 
Jamily of distributions. The transformation from F' to F¢ is called exponential 
tilting, exponential twisting, or simply an exponential change of measure. If 
F has a density f, then Fg has density 


Ser Ta), 


Suppose that X1,..., Xn are initially i.i.d. with distribution F = Fo and 
that we apply a change of measure under which they become i.i.d. with dis- 
tribution Fo. The likelihood ratio for this transformation is 


I Parca = xP (- x ef a) | (4.84) 


The standard normal distribution has 7)(@) = 67/2, from which we see that 
this indeed generalizes Example 4.6.1. A key feature of exponential twisting is 
that the likelihood ratio — which is in principle a function of all X1,..., Xn 
— reduces to a function of the sum of the X;. In statistical terminology, the 
sum of the X; is a sufficient statistic for 0. 

The cumulant generating function y records important information about 
the distributions Fy. For example, ~’(@) is the mean of Fy. To see this, let Eg 
denote expectation with respect to Fg and note that (0) = log Eolexp(@X )]. 
Differentiation yields 


Eo|Xe?*] _ 
(0) = ——— = Ep [Xe?*-¥) = Eg |X). 

A similar calculation shows that ~’’(@) is the variance of Fy. The function 4% 
passes through the origin; Holder’s inequality shows that it is convex, so that 
w’’(@) is indeed positive. For further theoretical background on exponential 
families see, e.g., Barndorff-Nielson [35]. 

We conclude with some examples of exponential families. ‘The normal dis- 
tributions N (6, 0o?) form an exponential family in @ for all e > 0. The gamma 
densities i 

a—l —2x/0 > 0 
Payee" e , «£2 >0, 
form an exponential family in 0 for each value of the shape parameter a > 0. 
With a = 1, this is the family of exponential distributions with mean 0. The 
Poisson distributions 3 
A 
k!’ 
form an exponential family in 6 = log A. The binomial distributions 


i een 
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n! k —k 
—— p(l — p)” k=0,1,... 
k!(n — kyl? ( p) 4 0, , n, 


form an exponential family in 0 = log(p/(1 — p)). O 


Example 4.6.3 Ruin probabilities. A classic application of importance sam- 
pling arises in estimating ruin probabilities in the theory of insurance risk. 
Consider an insurance firm earning premiums at a constant rate p per unit 
of time and paying claims that arrive at the jumps of a Poisson process with 
rate A. Letting N (t) denote the number of claims arriving in [0, t] and Y; the 
size of the ith claim, i = 1,2,..., the net payout of the firm over [0, t] is given 


by 
N(t) 


X Yi -pt 
i=1 


Suppose the firm has a reserve of x; then ruin occurs if the net payout ever 
exceeds x. We assume the claims are i.i.d. and independent of the Poisson 
process. We further assume that AE[Y;] < p, meaning that premiums flow in 
at a faster rate than claims are paid out; this ensures that the probability of 
eventual ruin is less than 1. | 

If ruin ever occurs, it must occur at the arrival of a claim. It therefore 
suffices to consider the discrete-time process embedded at the jumps of the 
Poisson process. Let £1, €2,... be the interarrival times of the Poisson process; 
these are independent and exponentially distributed with mean 1/X. The net 
payout between the (n — 1)th and nth claims (including the latter but not the 
former) is Xn = Yn — pén. The net payout up to the nth claim is given by the 
random walk Sn = X1 +--+ Xn. Ruin occurs at 


m= inn = 04S) Se}. 


with the understanding that 7, = oo if S, never exceeds x. The probability 
of eventual ruin is P(t, < oo). Figure 4.10 illustrates the notation for this 


example. 
The particular form of the increments X,, is not essential to the problem 
so we generalize the setting. We assume that X1, X2,... are ii.d. with 0 < 


P(X; > 0) < 1 and E[X;] < 0, but we drop the specific form Y, — p&,. We 
add the assumption that the cumulant generating function wx of the X; (cf. 
Example 4.6.2) is finite in a neighborhood of the origin. This holds in the 
original model if the cumulant generating function wy of the claim sizes Y; is 
finite in a neighborhood of the origin. 

For any point @ in the domain of wx, consider the exponential change of 
measure with parameter 0 and let Eg denote expectation under this measure. 
Because Ty is a stopping time, we may apply (4.81) to write the ruin probabil- 
ity P(t; < co) as an Eg-expectation. Because we have applied an exponential 
change of measure, the likelihood ratio simplifies as in (4.84); thus, the ruin 
probability becomes 
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Fig. 4.10. Illustration of claim sizes Y;, interarrival times £;, and the random walk 
Sn. Ruin occurs at the arrival of the rth claim. 


P(t, < 00) = Eg fe~PSra tox te Lr, < o0} | (4.85) 


If 0 < W,(0) < œ (which entails 6 > 0 because ¥(0) = 0 and w’(0) = 
E[X,,] < 0), then the random walk has positive drift Eọ| Xn] = Y% (0) under 
the twisted measure, and this implies Pg (Tẹ < oo) = 1. We may therefore omit 
the indicator inside the expectation on the right. It also follows that we may 
obtain an unbiased estimator of the ruin probability by simulating the random 
walk under P} until 7, and returning the estimator exp(—0S-, + Yx (0)Tz). 
This would not be feasible under the original measure because of the positive 
probability that Ty = oo. 

Among all 0 for which Y%(0) > 0, one is particularly effective for simula- 
tion and indeed optimal in an asymptotic sense. Suppose there is a 0 > 0 at 
which Yx(0) > 0. There must then be a 6, > 0 at which Yy (0+) = 0; con- 
vexity of Yx implies uniqueness of 0* and positivity of Y% (0+), as is evident 
from Figure 4.11. In the insurance risk model with Xn = Yn — pEn, 6, is the 
unique positive solution to 


À 
wy (0) + log (<3) = (). 


where wy is the cumulant generating function for the claim-size distribution. 
With the parameter 6,, (4.85) becomes 


Pin 6o) SE [e7 Pre | ee e TE jes (See =| 


Because the overshoot S+, — x is nonnegative, this implies the simple bound 
P(t: < 00) < e~%* on the ruin probability. Under modest additional reg- 
ularity conditions (for example, if the Xn have a density), Eg, je 9s Or E 
converges to a constant c as © — oo. providing the classical approximation 
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Fig. 4.11. Graph of a cumulant generating function wx. The curve passes through 
the origin and has negative slope there because Y% (0) = ELX] < 0. At the positive 
root @,, the slope is positive. 


P(Tz < 00) ~ ce7**®, 


meaning that the ratio of the two sides converges to 1 as x — oo. Further de- 
tails and some of the history of this approximation are discussed in Asmussen 
[20] and in references given there. 

From the perspective of simulation, the significance of @, lies in the vari- 
ance reduction achieved by the associated importance sampling estimator. ‘The 
unbiased estimator exp(—6@,5,,), sampled under Pg,, has second moment 


Ev, [aatia] < E7204, 


By Jensen’s inequality, the second moment of any unbiased estimator must be 
at least as large as the square of the ruin probability, and we have seen that this 
probability is O(e7®*” ). In this sense, the second moment of the importance 
sampling estimator based on 6, is asymptotically optimal as x — oo. 

This strategy for developing effective and even asymptotically optimal 
importance sampling estimators originated in Siegmund’s [331] application 
in sequential analysis. It has been substantially generalized, particularly for 
queueing and reliability applications, as surveyed in Heidelberger [175]. O 


Example 4.6.4 A knock-in option. As a further illustration of importance 
sampling through an exponential change of measure, we apply the method 
to a down-and-in barrier option. This example is from Boyle et al. [53]. The 
option is a digital knock-in option with payoff 


L{S(T) > K}-1{ min S(t) < H}, 


with 0 < ti <- < tm = T, S the underlying asset, K the strike, and 
H the barrier. If H is much smaller than S(0). mast. naths of an nrdinar 
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simulation will result in a payoff of zero; importance sampling can potentially 


make knock-ins less rare. 
Suppose the underlying asset is modeled through a process of the form 


Sal = S(O)exp(L,), In = SX, 


with X1, Xo,... iid. and Lo = 0. This includes geometric Brownian motion 
but many other models as well; see Section 3.5. The option payoff is then 


GS ie a oe me m) 


where c = log(K/S(0)), 7 is the first time the random walk Ln drops below 
—b, and —b = log(H/S(0)). If b or c is large, the probability of a payoff is 
small. To increase the probability of a payoff, we need to drive L,, down toward 
—b and then up toward c. 

Suppose the X; have cumulant generating function Y% and consider im- 
portance sampling estimators of the following form: exponentially twist the 
distribution of the X; by some 6_ (with drift ~’(@_) < 0) until the barrier 
is crossed, then twist the remaining X,41,...,Xm by some 6, (with drift 
vy’ (04) > 0) to drive the process up toward the strike. On the event {T < m}, 
the likelihood ratio for this change of measure is (using (4.81) and (4.84)) 


exp (-0_Ly + H(9-)r) -exp (—04[Lm — Lr] + ¥(04)[m — 1) 
= exp ((04 — 0-)Lr — O4Lm + (Y(0-) — Y(04))r + mb(64))- 


The importance sampling estimator is the product of this likelihood ratio and 
the discounted payoff. 

We now apply a heuristic argument to select the parameters 0_,0,. We 
expect most of the variability in the estimator to result from the barrier 
crossing time 7, because for large b and c we expect L, = —band Lm S c 
on the event {T < m, Lm > c}. (In other words, the undershoot below —b 
and the overshoot above c should be small.) If we choose 6_,64 to satisfy 
w(6_) = W(64), the likelihood ratio simplifies to 


exp ((04 — 0_)L, — 04Lm + mv(4)), 


and we thus eliminate explicit dependence on T. 

To complete the selection of the parameters 64, we impose the condition 
that traveling in a straight-line path from 0 to —b at rate |~’(@_)| and then 
from —b to c at rate w’(0,), the process should reach c at time m; i.e., 


—b f c+ 
— = Mm 
w(O_) (94) 
These conditions uniquely determine 0+, at least if the domain of w is suffi- 
minntly larwa Phic ia Whictrated in Kioure A 12 
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Fig. 4.12. Illustration of importance sampling strategy for a knock-in option. Twist- 
ing parameters 0+ are chosen so that (a) ~(@_) = #(@,) and (b) straight-line path 
with slopes w’(9_) and ~’(6,) reaches —b and then c in m steps. 


In the case of geometric Brownian motion GBM (u, a?) with equally spaced 
time points tn = nh, we have 


Xn ~ N((u — 30°)h, 07h), 
and the cumulant generating function is 
w(0) = (u — $07)hO + ż0°h0?. 


Because this function is quadratic in 0, it is symmetric about its minimum and 
the condition #(O_) = y(04) implies that ~’(6_) = —w’(@,). Thus, under our 
proposed change of measure, the random walk moves at a constant speed of 
lY (0+)|. To traverse the path down to the barrier and up to the strike in m 


steps, we must have : 
2b+¢ 
[p'(8)| = : 


m 


We can now solve for the twisting parameters to get 


sài i)i 


moh 


The term in parentheses on the right is the point at which the quadratic wy is 
minimized. The twisting parameters 6, are symmetric about this point. 
Table 4.4 reports variance ratios based on this method. The underlying 
asset S(t) is GBM(r,o7) with r = 5%, o = 0.15, and initial value S(0) = 95. 
We consider an option paying 10,000 if not knocked out, hence having price 
10,000. e-"? P(r < m,S(T) > K). As above, m is the number of steps 
and T = tm is the option maturity. The last column of the table gives the 
estimated ratio of the variance per replication using ordinary Monte Carlo to 
the variance using importance sampling. It is thus a measure of the speed- 
up produced by importance sampling. The estimates in the tahle are hased 
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on 100,000 replications for each case. The results suggest that the variance 
ratio depends primarily on the rarity of the payoff, and not otherwise on the 
maturity. The variance reduction can be dramatic for extremely rare payoffs. 

An entirely different application of importance sampling to barrier options 
is developed in Glasserman and Staum [146]. In that method, at each step 
along a simulated path, the value of an underlying asset is sampled conditional 
on not crossing a knock-out barrier so that all paths survive to maturity. 
The one-step conditional distributions define the change of measure in this 
approach. O 


H K Price Variance Ratio 


T = 0.25, m = 50 94 96 3017.6 2 
90 96 426.6 10 
85 96 5.6 ATT 
90 106 13.2 177 
T = 1, m = 50 90 106 664.8 6 
85 96 452.0 9 
T = 0.25, m = 100 85 96 6.6 405 
90 106 15.8 180 


Table 4.4. Variance reduction using importance sampling in pricing a knock-in 
barrier option with barrier H and strike K. 


4.6.2 Path-Dependent Options 


We turn now to a more ambitious application of importance sampling with 
the aim of reducing variance in pricing path-dependent options. We consider 
models of underlying assets driven by Brownian motion (or simply by nor- 
mal random vectors after discretization) and change the drift of the Brownian 
motion to drive the underlying assets into “important” regions, with “impor- 
tance” determined by the payoff of the option. We identify a specific change 
of drift through an optimization problem. 

The method described in this section is from Glasserman, Heidelberger, 
and Shahabuddin (henceforth abbreviated GHS) [139], and that reference con- 
tains a more extensive theoretical development than we provide here. This 
method restricts itself to deterministic changes of drift over discrete time 
steps. It is theoretically possible to eliminate all variance through a stochas- 
tic change of drift in continuous time, essentially by taking the option being 
priced as the numeraire asset and applying the change of measure associated 
with this change of numeraire. This however requires knowing the price of the 
option in advance and is not literally feasible, though it potentially provides a 
basis for approximations. Related ideas are developed in Chapter 16 of Kloe- 
den and Platen [211], Newton [278, 279], and Schoenmakers and Heemink 
[21A] 
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We restrict ourselves to simulations on a discrete time grid 0 = to < ti < 
--» < tm = T. We assume the only source of randomness in the simulated 
model is a d-dimensional Brownian motion. ‘The increment of the Brownian 
motion from t;—ı to t; is simulated as ,/t; — ti—1 Zi, where Z1, Z2,..., Zm are 
independent d-dimensional standard normal random vectors. Denote by Z the 
concatenation of the Z; into a single vector of length n = md. Each outcome 
of Z determines a path of underlying assets or state variables, and each such 
path determines the discounted payoff of an option. If we let G denote the 
composition of these mappings, then G(Z) is the discounted payoff derived 
from Z. Our task is to estimate E{G(Z)]|, the expectation taken with Z having 
the n-dimensional standard normal distribution. 

An example will help fix ideas. Consider a single underlying asset modeled 
as geometric Brownian motion GBM(r, g?) and simulated using 


S(t;) = S(t;-1) exp (Ir - 50° |(ts —ti-1) +0 ti — 41%; ; = dee ole 
E (4.86) 
Consider an Asian call option on the arithmetic average S of the S(t). We 
may view the payoff of the option as a function of the Z; and thus write 


CNSC amn =e SER, 


Pricing the option means evaluating E[G(Z)], the expectation taken with Z ~ 
N (0,1). 


Change of Drift: Linearization 


Through importance sampling we can change the distribution of Z and still 
obtain an unbiased estimator of E[G(Z)], provided we weight each outcome 
by the appropriate likelihood ratio. We restrict ourselves to changes of dis- 
tribution that change the mean of Z from 0 to some other vector u. Let P, 
and E,, denote probability and expectation when Z ~ N(, I). From the form 
of the likelihood ratio given in Example 4.6.1 for normal random vectors, we 
find that 


1 
EIG(Z)] = E, eae aa 
for any u € R”. We may thus simulate as follows: 


for replications i = 1,..., N 

generate Z® ~ N(u, T) 

Y® — G(Z®)) exp (—u'Z® + 4u! u) 
return (YO) +... + YOD)/N. 


This estimator is unbiased for any choice of u; we would like to choose a u 
that produces a low-variance estimator. 

If G takes only nonnegative values (as is typical of discounted option pay- 
offs), we may write G(z) = exp(F(z)), with the convention that F(z) = —co 
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if G(z) = 0. Also, note that taking an expectation over Z under P, is equiva- 
lent to replacing Z with u + Z and taking the expectation under the original 
measure. (In the algorithm above, this simply means that we can sample from 
N (u, I) by sampling from N(0, J) and adding u.) Thus, 


E[G(Z)] = E eR) sE PA 


sE E 
-E Paer bo Hl . (4.87) 


For any u, the expression inside the expectation in (4.87) is an unbiased 
estimator with Z having distribution N(0,/). To motivate a particular choice 
of u, we now expand F' to first order to approximate the estimator as 


atte db ait oes ee dr 
eh (u+Z)oTH 2-9 Bey pFU)AVEUu)2Z 0-H Z-3H p (4.88) 


with VF (u) the gradient of F at u. If we can choose u to satisfy the fixed-point 


condition 
VF(u)=u', (4.89) 


then the expression on the right side of (4.88) collapses to a constant with 
no dependence on Z. Thus, applying importance sampling with u satisfying 
(4.89) would produce a zero-variance estimator if (4.88) held exactly, and it 
should produce a low-variance estimator if (4.88) holds only approximately. 


Change of Drift: Normal Approximation and Optimal Path 


We now present an alternative argument leading to an essentially equivalent 
choice of u. Recall from the discussion surrounding (4.76) that the optimal 
importance sampling density is the normalized product of the integrand and 
the original density. For the problem at hand, this means that the optimal 


density is proportional to 
z 


PORSE r 
because exp(F(z)) is the integrand and exp(—z' z/2) is proportional to the 
standard normal density. Normalizing this function by its integral produces 
a probability density but not, in general, a normal density. Because we have 
restricted ourselves to changes of mean, we may try to select u so that N (pu, I) 
approximates the optimal distribution. One way to do this is to choose u to 


be the mode of the optimal density; i.e., choose u to solve 


max F(z)— $z! z. (4.90) 


zZ 
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The first-order condition for the optimum is VF(z) = z', which coincides 
with (4.89). If, for example, the objective in (4.90) is strictly concave, and if 
the first-order condition has a solution, this solution is the unique optimum. 

We may interpret the solution z, to (4.90) as an optimal path. Each z € R” 
may be interpreted as a path because each determines a discrete Brown- 
ian path and thus a path of underlying assets. The solution to (4.90) is the 
most “important” path if we measure importance by the product of payoff 
exp(F'(z)) and probability density exp(—z! z/2)/(2m)"/. In choosing u = Zx, 
we are therefore choosing the new drift to push the process along the optimal 
path. 

GHS [139] give conditions under which this approach to importance sam- 
pling has an asymptotic optimality property. This property is based on intro- 
ducing a parameter € and analyzing the second moment of the estimator as € 
approaches zero. From a practical perspective, a small € should be interpreted 


as a nearly linear F. 


Asian Option 


We illustrate the selection and application of the optimal change of drift in 
the case of the Asian call defined above, following the discussion in GHS 
[139]. Solving (4.90) is equivalent to maximizing G(z) exp(—z' z/2) with G the 
discounted payoff of the Asian option. The discount factor e~"? is a constant in 
this example, so for the purpose of optimization we may ignore it and redefine 
G(z) to be [S — K]*. Also, in maximizing it clearly suffices to consider points 
z at which S > K and thus at which G is differentiable. 
For the first-order conditions, we differentiate 


to get 


Using (4.86), we find that 


6S 1 dS) 1S 
TE A az, oe 


Oz; mé 
w=] 


The first-order conditions thus become 


ing OVE — G1 S(t) 


mG(z) 


ol 


Now we specialize to the case of an equally spaced time grid with t;-t;_1 = 
h. This yields 
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„ IVG) +K) vhs) PS 
= Gia) : Beeps mle) | 1. (4.91) 


Given the value of G(z), (4.91) and (4.86) determine z. Indeed, if y = G(z), 
we could apply (4.91) to calculate zı from y, then (4.86) to calculate S(t), 
then (4.91) to calculate z2, and so on. Through this iteration, each value 
of y determines a z(y) and path S(t;,y), i =1,...,m. Solving the first-order 
conditions reduces to finding the y for which the payoff at S(t1,y),..., 5 (tm,y) 
is indeed y; that is, it reduces to finding the root of the equation 


GHS [139] report that numerical examples suggest that this equation has a 
unique root. This root can be found very quickly through a one-dimensional 
search. Once the root y, is found, the optimization problem is solved by 
zx = z(y«). To simulate, we then set u = z, and apply importance sampling 
with mean p. 


Combined Importance Sampling and Stratification 


In GHS [139], further (and in some cases enormous) variance reduction is 
achieved by combining importance sampling with stratification of a linear 
projection of Z. Recall from Section 4.3.2 that sampling Z so that v! Z is 
stratified for some v € R” is easy to implement. The change of mean does 
not affect this. Indeed, we may sample from N (u, I) by sampling from N (0, J) 
and then adding p; we can apply stratified sampling to N(0,/) before adding 
u. 

Two strategies for selecting the stratification direction v are considered 
in GHS [139]. One simply sets v = u on the grounds that u is an important 
path and thus a potentially important direction for stratification. The other 


strategy expands (4.88) to get 
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with H(u) the Hessian matrix of F at u. Importance sampling with u' = 
VF(u) eliminates the linear term in the exponent, and this suggests that the 
stratification should be tailored to the quadratic term. 

In Section 4.3.2, we noted that the optimal stratification direction for 
estimating an expression of the form Elexp(5Z'AZ)] with A symmetric is an 
eigenvector of A. The optimal eigenvector is determined by the eigenvalues 
of A through the criterion in (4.50). This suggests that we should stratify 
along the optimal eigenvector of the Hessian of F at u. This entails numerical 
calculation of the Hessian and its eigenvectors. 
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Table 4.5 shows results from GHS [139]. The table shows variance ratios 
(i.e., variance reduction factors) using importance sampling and two combina- 
tions of importance sampling with stratified sampling, using the two strategies 
just described for selecting a stratification direction. All results use S(0) = 50, 
r = 0.05, and T = 1 and are estimated from one million paths for each case. 
The results show that importance sampling by itself can produce noteworthy 
variance reduction (especially for out-of-the-money options) and that the com- 
bined impact with stratification can be astounding. The combination reduces 
variance by factors in the thousands. 


Importance IS & IS & 

n o K Price Sampling Strat. (u) Strat. (v;*) 
16 0.10 45 6.05 11 1,097 1,246 
50 1.92 7 4,559 5,710 

55 0.20 21 15,520 17,026 

16 0.30 45 7.15 8 1,011 1,664 
50 4.17 9 1,304 1,899 

55 2.21 12 1,746 2,296 

64 0.10 45 6.00 11 967 1,022 
90 1.85 7 4,637 5,665 

55 0.17 23 16,051 17,841 

64 0.30 45 7.02 8 1,016 1,694 
50 4.02 9 1,319 1,971 

55 2.08 12 1,767 2,402 


Table 4.5. Estimated variance reduction ratios for Asian options using importance 
sampling and combinations of importance sampling with stratified sampling, strat- 
ifying along the optimal u or the optimal eigenvector v;+. Stratified results use 100 


strata. 


The results in ‘Table 4.5 may seem to suggest that stratification has a 
greater impact than importance sampling, and one may question the value of 
importance sampling in this example. But the effectiveness of stratification is 
indeed enhanced by the change in mean, which results in more paths producing 
positive payoffs. The positive-part operator [-]* applied to S — K diminishes 
the effectiveness of stratified sampling, because it tends to produce many 
strata with a constant (zero) payoff — stratifying a region of constant payoff is 
useless. By shifting the mean of Z, we implicitly move more of the distribution 
of the stratification variable v' Z (and thus more strata) into the region where 
the payoff varies. In this particular example, the region defined by S > K is 
reasonably easy to characterize and could be incorporated into the selection 
of strata; however, this is not the case in more complex examples. 

A further notable feature of Table 4.5 is that the variance ratios are quite 
similar whether we stratify along u or along the optimal eigenvector v,. In 
fact, GHS [139] find that the vectors u and v, nearly coincide, once normalized 
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to have the same length. They find similar patterns in other examples. This 
phenomenon can occur when G is well approximated by a nonlinear function 
of a linear combination of 21,..., Zm. For suppose G(z) œ% g(v'z); then, the 
gradient of G is nearly proportional to v' and so p will be nearly proportional 
to v. Moreover, the Hessian of G will be nearly proportional to the rank-1 
matrix vu! , whose only nonzero eigenvectors are multiples of v. Thus, in this 
setting, the optimal mean is proportional to the optimal eigenvector. 


Application in the Heath-Jarrow-Morton Framework 


GHS [140] apply the combination of importance sampling and stratified sam- 
pling in the Heath-Jarrow-Morton framework. The complexity of this setting 
necessitates some approximations in the calculation of the optimal path and 
eigenvector to make the method computationally feasible. We comment on 
these briefly. 

We consider a three-factor model (so d = 3) discretized in time and matu- 
rity as detailed in Section 3.6.2. The discretization interval is a quarter of a 
year and we consider maturities up to 20 years, so m = 80 and the vector Z 
of random inputs has dimension n = md = 240. The factor loadings for each 
of the three factors are as displayed in Figure 4.13, where they are plotted 
against time to maturity. 


Factor 1 
Factor 2 
---- Factor3 


Factor Loading 


ee 10 20 30 40 50 60 70 80 
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Fig. 4.13. Factor loadings in three-factor HJM model used to illustrate importance 
sampling. 


As in Section 3.6.2, we use f (t;,t;) to denote the forward rate at time t; 
for the interval [t;,t,;.1]. We use an equally spaced time grid so t; = ih with 
h a quarter of a year. The initial forward curve is 


f(0,t;) = log(150 + 127)/100, 7 =0,1,...,80, 
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which increases gradually, approximately from 5% to 7%. 
Among the examples considered in GHS [140] is the pricing of an interest 
rate caplet maturing in T = tm years with a strike of K. The caplet pays 


max (0, (ef (tmtm)h —l)- Kh)) 


at tm+1. For a maturity of T = 5 years, m = 20 and the optimal drift u is a 
vector of dimension md = 60. We encode this vector in the following way: the 
first 20 components give the drift as a function of time for the first factor (i.e., 
the first component of the underlying three-dimensional Brownian motion); 
the next 20 components give the drift for the second factor; and the last 20 
components give the drift for the third factor. 

With this convention, the left panel of Figure 4.14 displays the optimal 
drift found through numerical solution of the optimization problem (4.90), 
with exp(F(z)) the discounted caplet payoff determined by the input vector 
z. This optimal path gives a positive drift to the first and third factors and a 
negative drift to the second; all three drifts approach zero toward the end of 
the 20 steps (the caplet maturity). 

The right panel of Figure 4.14 shows the net effect on the short rate of this 
change of drift. (Recall from Section 3.6.2 that the short rate at t; is simply 
f (ti, t;).) If each of the three factors were to follow the mean paths in the left 
panel of Figure 4.14, the short rate would follow the dashed line in the right 
panel of the figure. This should be compared with the initial forward curve, 
displayed as the solid line: without a change of drift, the central path of the 
short rate would roughly follow this forward curve. Thus, the net effect of the 
change of drift in the underlying factors is to push the short rate higher. This 
is to be expected because the caplet payoff increases with the short rate. But 
it is not obvious that the “optimal” way to push the short rate has the factors 
follow the paths in the left panel of Figure 4.14. 
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Fig. 4.14. Left panel shows optimal drift for factors in pricing a five-year caplet 
struck at 7%. Right panel compares path of short rate produced by optimal drift 
with the initial forward curve. 
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Table 4.6 compares the efficacy of four variance reduction techniques in 
pricing caplets: antithetic variates, importance sampling using the optimal 
drift, and two combinations of importance sampling with stratified sampling 
using either the optimal drift u or the optimal eigenvector vj» for the strat- 
ification direction. Calculation of the optimal eigenvector requires numerical 
calculation of the Hessian. The table (from [140]) displays estimated variance 
ratios based on 50,000 paths per method. The last two columns are based on 
100 strata. The table indicates that importance sampling is most effective for 
deep out-of-the-money caplets, precisely where antithetics become least effec- 
tive. Both strategies for stratification produce substantial additional variance 
reduction in these examples. 


IS & IS & 

T K Antithetics IS Strat. (wu) Strat. (v;«) 
2.5 0.04 8 8 246 248 
0.07 1 16 510 444 
0.10 1 173 3067 2861 

5.0 0.04 4 8 188 211 
0.07 1 11 241 292 
0.10 1 27 475 912 
10.0 0.04 4 T 52 141 
0.07 1 8 70 185 
0.10 1 12 110 244 
15.0 0.04 4 3 15 67 
0.07 2 6 22 112 
0.10 1 8 31 158 


Table 4.6. Estimated variance ratios for caplets in three-factor HJM model. 


Further numerical results for other interest rate derivatives are reported 
in GHS [140]. As in Table 4.6, the greatest variance reduction typically occurs 
for out-of-the-money options. The combination of importance sampling with 
stratification is less effective for options with discontinuous payoffs; a specific 
example of this in [140] is a “flex cap,” in which only a subset of the caplets 
in a cap make payments even if all expire in-the-money. 

In a complex, high-dimensional setting like an HJM model, the computa- 
tional overhead involved in finding the optimal path or the optimal eigenvector 
can become substantial. These are fixed costs, in the sense that these calcu- 
lations need only be done once for each pricing problem rather than once for 
each path simulated. In theory, as the number of paths increases (which is to 
say as the required precision becomes high), any fixed cost eventually becomes 
negligible, but this may not be relevant for practical sample sizes. 

GHS [140] develop approximations to reduce the computational overhead 
in the optimization and eigenvector calculations. These approximations are 
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based on assuming that the optimal drift (or optimal eigenvector) are piece- 
wise linear between a relatively small number of nodes. This reduces the di- 
mension of the problem. Consider, for example, the computation of a 120- 
dimensional optimal drift consisting of three 40-dimensional segments. Mak- 
ing a piecewise linear approximation to each segment based on four nodes per 
segment reduces the problem to one of optimizing over the 12 nodes rather 
than all 120 components of the vector. Figure 4.15 illustrates the results of 
this approach in finding both the optimal drift and the optimal eigenvector for 
a ten-year caplet with a 7% strike. These calculations are explained in detail 
in GHS [140]. The figure suggests that the approach is quite effective. Simu- 
lation results reported in [140] based on using these approximations confirm 
the viability of the approach. 
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Fig. 4.15. Optimal drift (left) and eigenvector (right) calculated using a piecewise 

linear approximation with three or four nodes per segment. 


4.7 Concluding Remarks 


It is easier to survey the topic of variance reduction than to answer the ques- 
tion that brings a reader to such a survey: “Which technique should I use?” 
There is rarely a simple answer to this question. The most effective applica- 
tions of variance reduction techniques take advantage of an understanding of 
the problem at hand. The choice of technique should depend on the avail- 
able information and on the time available to tailor a general technique to a 
particular problem. An understanding of the strengths and weaknesses of al- 
ternative methods and familiarity with examples of effective applications are 
useful in choosing a technique. 

Figure 4.16 provides a rough comparison of several techniques discussed in 
this chapter. The figure positions the methods according to their complexity 
and shows a range of typical effectiveness for each. We have not given the 
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axes units or scales, nor have we defined what they mean or what counts as 
typical; the figure should clearly not be taken literally or too seriously. We 
include it to highlight some differences among the methods in the types of 
implementations and applications discussed in this chapter. 


Importance 
Sampling 
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Antithetics Complexity 


Fig. 4.16. Schematic comparison of some variance reduction techniques 


The effectiveness of a method is the efficiency improvement it brings in 
the sense discussed in Section 1.1.3. Effectiveness below the horizontal axis 
in the figure is detrimental — worse than not using any variance reduction 
technique. In the complexity of a method we include the level of effort and 
detailed knowledge required for implementation. We now explain the positions 
of the methods in the figure: 


o Antithetic sampling requires no specific information about a simulated 
model and is trivial to implement. It rarely provides much variance re- 
duction. It can be detrimental (as explained in Section 4.2), but such cases 
are not common. 

o Control variates and methods for matching underlying assets (including 
path adjustments and weighted Monte Carlo) give similar results. Of these 
methods, control variates are usually the easiest to implement and the best 
understood. Finding a good control requires knowing something about the 
simulated model — but not too much. A control variate implemented with 
an optimal coefficient is guaranteed not to increase variance. When the op- 
timal coefficient is estimated rather than known, it is theoretically possible 
for a control variate to be detrimental at very small sample sizes, but this 
is not a practical limitation of the method. 

o We have positioned stratified sampling at a higher level of complexity be- 
cause we have in mind examples like those in Section 4.3.2 where the strat- 
ification variable is tailored to the model. In contrast, stratifying a uniform 
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distribution (as in Example 4.3.1) is trivial but not very effective by itself. 
Using a variable for stratification requires knowing its distribution whereas 
using it as a control only requires knowing its mean. Stratified sampling is 
similar to using the distribution (more precisely, the stratum probabilities) 
as controls and is thus more powerful than using just the mean. With a 
proportional allocation, stratified sampling never increases variance; using 
any other allocation may be viewed as a form of importance sampling. 

o We have omitted Latin hypercube sampling from the figure. As a general- 
ization of stratified sampling, it should lie upwards and to the right of that 
method. But a generic application of Latin hypercube sampling (in which 
the marginals simply correspond to the uniform random variables used to 
drive a simulation) is often both easier to implement and less effective than 
a carefully designed one-dimensional stratification. A problem for which a 
generic application of Latin hypercube sampling is effective, is also a good 
candidate for the quasi-Monte Carlo methods discussed in Chapter 5. 

o As emphasized in the figure, importance sampling is the most delicate of 
the methods discussed in this chapter. It has the capacity to exploit detailed 
knowledge about a model (often in the form of asymptotic approximations) 
to produce orders of magnitude variance reduction. But if the importance 
sampling distribution is not chosen carefully, this method can also increase 
variance. Indeed, it can even produce infinite variance. 


There are counterexamples to nearly any general statement one could make 
comparing variance reduction techniques and one could argue against any of 
the comparisons implied by Figure 4.16. Nevertheless, we believe these com- 
parisons to be indicative of what one finds in applying variance reduction 
techniques to the types of models and problems that arise in financial engi- 
neering. i 

We close this section with some further references to the literature on vari- 
ance reduction. Glynn and Iglehart [156] survey the application of variance 
reduction techniques in queueing simulations and discuss some techniques 
not covered in this chapter. Schmeiser, Taaffe, and Wang [318] analyze biased 
control variates with coefficients chosen to minimize root mean square error; 
this is relevant to, e.g., Example 4.1.3 and to settings in which the mean of 
a control variate is approximated using a binomial lattice. The conditional 
sampling methods of Cheng [81, 82] are relevant to the methods discussed 
in Sections 4.3.2 and 4.5.1. Hesterberg [177] and Owen and Zhou [292] pro- 
pose defensive forms of importance sampling that mix an aggressive change 
of distribution with a more conservative one to bound the worst-case vari- 
ance. Dupuis and Wang [108] show that a dynamic exponential change of 
measure — in which the twisting parameter is recomputed at each step — 
can outperform a static twist. An adaptive importance sampling method for 
Markov chains is shown in Kollman et al. [214] to converge exponentially fast. 
An importance sampling method for stochastic volatility models is developed 
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in Fournié, Lasry, and Touzi [125]. Schoenmakers and Heemink [319] apply 
importance sampling for derivatives pricing through an approximating PDE. 

Shahabuddin [327] uses rare-transition asymptotics to develop an impor- 
tance sampling procedure for reliability systems; see Heidelberger [175] and 
Shahabuddin [328] for more on this application area. Asmussen and Bin- 
swanger |22], Asmussen, Binswanger, and Højgaard [23], and Juneja and Sha- 
habuddin [206] address difficulties in applying importance sampling to heavy- 
tailed distributions; see also Section 9.3. 

Avramidis and Wilson [30] and Hesterberg and Nelson [178] analyze vari- 
ance reduction techniques for quantile estimation. We return to this topic in 
Chapter 9. 

Among the techniques not discussed in this chapter is conditional Monte 
Carlo, also called Rao-Blackwellization. This method replaces an estimator by 
its conditional expectation. Asmussen and Binswanger [22] give a particularly 
effective application of this idea to an insurance problem; Fox and Glynn [129] 
combine it with other techniques to estimate infinite horizon discounted costs; 
Boyle et al. [53] give some applications in option pricing. 


5 
Quasi-Monte Carlo 


This chapter discusses alternatives to Monte Carlo simulation known as quasi- 
Monte Carlo or low-discrepancy methods. These methods differ from ordinary 
Monte Carlo in that they make no attempt to mimic randomness. Indeed, 
they seek to increase accuracy specifically by generating points that are too 
evenly distributed to be random. Applying these methods to the pricing of 
derivative securities requires formulating a pricing problem as the calculation 
of an integral and thus suppressing its stochastic interpretation as an expected 
value. This contrasts with the variance reduction techniques of Chapter 4, 
which take advantage of the stochastic formulation to improve precision. 

Low-discrepancy methods have the potential to accelerate convergence 
from the O(1/,/n) rate associated with Monte Carlo (n the number of paths 
or points generated) to nearly O(1/n) convergence: under appropriate con- 
ditions, the error in a quasi-Monte Carlo approximation is O(1/n+~*) for all 
€ > 0. Variance reduction techniques, affecting only the implicit constant in 
O(1/,/n), are not nearly so ambitious. We will see, however, that the e in 
O(1/n'~*) hides a dependence on problem dimension. 

The tools used to develop and analyze low-discrepancy methods are very 
different from those used in ordinary Monte Carlo, as they draw on number 
theory and abstract algebra rather than probability and statistics. Our goal 
is therefore to present key ideas and methods rather than an account of the 
underlying theory. Niederreiter [281] provides a thorough treatment of the 


theory. 


5.1 General Principles 


This section presents definitions and results from the theory of quasi-Monte 
Carlo (QMC) methods. It is customary in this setting to focus on the problem 
of numerical integration over the unit hypercube. Recall from Section 1.1 and 
the many examples in Chapter 3 that each replication in a Monte Carlo simu- 
lation can be interpreted as the result of applying a series of transformations 
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(implicit in the simulation algorithm) to an input sequence of independent 
uniformly distributed random variables U1, U2,.... Suppose there is an up- 
per bound d on the number of uniforms required to produce a simulation 
output and let f(U;,...,Uq) denote this output. For example, f may be the 
result of transformations that convert the U; to normal random variables, 
the normal random variables to paths of underlying assets, and the paths to 
the discounted payoff of a derivative security. We suppose the objective is to 


calculate 


Elf (Uess Ua) = f f(x) dz. (5.1) 


[0,1)¢ 


Quasi-Monte Carlo approximates this integral using 


OLES fle (5.2) 


for carefully (and deterministically) chosen points z1,..., £n in the unit hy- 
percube [0, 1)?. 
A few issues require comment: 


o The function f need not be available in any explicit form; we merely require 
a method for evaluating f, and this is what a simulation algorithm does. 

o Whether or not we include the boundary of the unit hypercube in (5.1) and 
(5.2) has no bearing on the value of the integral and is clearly irrelevant 
in ordinary Monte Carlo. But some of the definitions and results in QMC 
require care in specifying the set to which points on a boundary belong. It 
is convenient and standard to take intervals to be closed on the left and 
open on the right, hence our use of [0, 1)? as the unit hypercube. 

o In ordinary Monte Carlo simulation, taking a scalar i.i.d. sequence of uni- 
forms U1, U2,... and forming vectors (U1, ..., Ua), (Ua+i,..., Uaa),... pro- 
duces an i.i.d. sequence of points from the d-dimensional hypercube. In 
QMC, the construction of the points x; depends explicitly on the dimen- 
sion of the problem — the vectors z; in (0, I) cannot be constructed by 
taking sets of d consecutive elements from a scalar sequence. 


The dependence of QMC methods on problem dimension is one of the 
features that most distinguishes them from Monte Carlo. If two different 
Monte Carlo algorithms corresponding to functions f : [0,1) — R and 
g : (0,1) — R resulted in f(Ui,...,Ua,) and g(U1,..., Ua) having the 
same distribution, then these two algorithms would have the same bias and 
variance properties. The preferred algorithm would be the one requiring less 
time to evaluate; the dimensions d;,d2 would be irrelevant except to the 
extent that they affect the computing times. In ordinary Monte Carlo one 
rarely even bothers to think about problem dimension, whereas in QMC the 
dimension must be identified explicitly before points can be generated. Lower- 
dimensional representations generally result in smaller errors. For some Monte 
Carlo algorithms, there is no upper bound d on the number of input uniforms 
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required per output; this is true, for example, of essentially all simulations 
using acceptance-rejection methods, as noted in Section 2.2.2. Without an 
upper bound d, QMC methods are inapplicable. 

For the rest of this chapter, we restrict attention to problems with a finite 
dimension d and consider approximations of the form in (5.2). The goal of 
low-discrepancy methods is to construct points x; that make the error in 
(5.2) small for a large class of integrands f. It is intuitively clear (and, as 
we will see, correct in a precise sense) that this is equivalent to choosing the 
points x; to fill the hypercube uniformly. 


5.1.1 Discrepancy 


A natural first attempt at filling the hypercube uniformly would choose the 
x; to lie on a grid. But grids suffer from several related shortcomings. If the 
integrand f is nearly a separable function of its d arguments, the informa- 
tion contained in the values of f at n? grid points is nearly the same as the 
information in just nd of these values. A grid leaves large rectangles within 
(0,1)? devoid of any points. A grid requires specifying the total number of 
points n in advance. If one refines a grid by adding points, the number of 
points that must be added to reach the next favorable configuration grows 
very quickly. Consider, for example, a grid constructed as the Cartesian prod- 
uct of 2" points along each of d dimensions for a total of 2%% points. Now 
refine the grid by adding a point in each gap along each dimension; 1.e., by 
doubling the number of points along each dimension. The total number of 
points added to the original grid to reach the new grid is 2\4¢+))4 — 2¥4 which 
grows very quickly with k. In contrast, there are low-discrepancy sequences 
with guarantees of uniformity over bounded-length extensions of an initial 
segment of the sequence. 

To make these ideas precise, we need a precise notion of uniformity — or 
rather deviation from uniformity, which we measure through various notions of 
discrepancy. Given a collection A of (Lebesgue measurable) subsets of [0, 1), 


the discrepancy of the point set {r,,..., £n } relative to A is 
iE A 
Dis A) = Sup NEED vol(A)|. (5.3) 
AEA n 


Here, #{x; € A} denotes the number of x; contained in A and vol( A) denotes 
the volume (measure) of A. Thus, the discrepancy is the supremum over errors 
in integrating the indicator function of A using the points z1,..., £n. (In all 
interesting cases the x; are distinct points, but to cover the possibility of 
duplication, count each point according to its multiplicity in the definition of 
discrepancy.) 

Taking A to be the collection of all rectangles in [0,1)¢ of the form 


Os), O< Uj < Vj < l, 
1 


d 
J= 
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yields the ordinary (or extreme) discrepancy D(21,...,%p). Restricting A to 
rectangles of the form 
d 
j=l 
defines the star discrepancy D*(x1,...,%p). The star discrepancy is obviously 


no larger than the ordinary discrepancy; Niederreiter [281], Proposition 2.4, 
shows that 


Ba EEE hi) SO Linnie eg) 2°D*(21,...,2n), 


so for fixed d the two quantities have the same order of magnitude. 

Requiring each of these discrepancy measures to be small is consistent with 
an intuitive notion of uniformity. However, both measures focus on products 
of intervals and ignore, for example, a rotated subcube of the unit hypercube. 
If the integrand f represents a simulation algorithm, the coordinate axes may 
not be particularly meaningful. The asymmetry of the star discrepancy may 
seem especially odd: in a Monte Carlo simulation, we could replace any uni- 
form input U; with 1 — U; and thus interchange 0 and 1 along one coordinate. 
If, as may seem more natural, we take A to be all convex subsets of [0, 1], we 
get the isotropic discrepancy; but the magnitude of this measure can be as 
large as the dth root of the ordinary discrepancy (see p.17 of Niederreiter [281] 
and Chapter 3 of Matoušek [256]). We return to this point in Section 5.1.3. 

We will see in Section 5.1.3 that these notions of discrepancy are indeed 
relevant to measuring the approximation error in (5.2). It is therefore sensible 
to look for points that achieve low values of these discrepancy measures, and 
that is what low-discrepancy methods do. 

In dimension d = 1, Niederreiter [281], pp.23-24, shows that 


1 


DiGi ee 


1 
. Dens 2 F, (5.5) 
n 


and that in both cases the minimum is attained by 


n=, aen (5.6) 
For this set of points, (5.2) reduces to the midpoint rule for integration over 
the unit interval. Notice that (5.6) does not define the first n points of an 
infinite sequence; in fact, the set of points defined by (5.6) has no values in 
common with the corresponding set for n + 1. 

Suppose, in contrast, that we fix an infinite sequence z1, £2,... of points 
in [0, 1) and measure the discrepancy of the first n points. From the perspec- 
tive of numerical integration, this is a more relevant case if we hope to be 
able to increase the number of points in an approximation of the form (5.2). 
Niederreiter [281], p.24, cites references showing that in this case 
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n 
for infinitely many n, with c a constant. This situation is typical of low- 
discrepancy methods, even in higher dimensions: one can generally achieve a 
lower discrepancy by fixing the number of points n in advance; using the first 
n points of a sequence rather than a different set of points for each n typically 
increases discrepancy by a factor of logn. 

Much less is known about the best possible discrepancy in dimensions 
higher than 1. Niederreiter [281], p.32, states that “it is widely believed” that 


in dimensions d > 2, any point set £1,..., Zn satisfies 
logn)¢- 
D iij a Ln) 2 aT 
n 
and the first n elements of any sequence z1, £2,... satisfy 
log n)? 
D (Tisian) 2 a 
n 


for constants cg, c4} depending only on the dimension d. These order-of- 
magnitude discrepancies are achieved by explicit constructions (discussed in 
Section 5.2). It is therefore customary to reserve the informal term “low- 
discrepancy” for methods that achieve a star discrepancy of O((log n)t/n). 
The logarithmic term can be absorbed into any power of n, allowing the 
looser bound O(1/n'~®), for all e > 0. 

Although any power of logn eventually becomes negligible relative to n, 
this asymptotic property may not be relevant at practical values of n if d is 
large. Accordingly, QMC methods have traditionally been characterized as ap- 
propriate only for problems of moderately high dimension, with some authors 
putting the upper limit at 40 dimensions, others putting it as low as 12 or 15. 
But in many recent applications of QMC to problems in finance, these meth- 
ods have been found to be effective in much higher dimensions. We present 
some evidence of this in Section 5.5 and comment further in Section 5.6. 


5.1.2 Van der Corput Sequences 


Before proceeding with a development of further theoretical background, we 
introduce a specific class of one-dimensional low-discrepancy sequences called 
Van der Corput sequences. In addition to illustrating the general notion of 
discrepancy, this example provides the key element of many multidimensional 
constructions. 

By a base we mean an integer b > 2. Every positive integer k has a unique 
representation (called its base-b or b-ary expansion) as a linear combination 
of nonnegative powers of b with coefficients in {0,1,...,6— 1}. We can write 


this as 
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kS Gv, (5.7) 
j=0 


with all but finitely many of the coefficients a; (k) equal to zero. The radical 
inverse function p, maps each k to a point in [0, 1) by flipping the coefficients 
of k about the base-b “decimal” point to get the base-b fraction .aga,a2.... 
More precisely, 
A a (K) 
alk) = Ril (5.8) 


j=0 


The base-b Van der Corput sequence is the sequence 0 = Y» (0), wo(1), wo(2), 
.... Its calculation is illustrated in Table 5.1 for base 2. 


k k Binary vo(k) Binary we(k) 
0 0 0 0 

1 1 0.1 1/2 
2 10 0.01 1/4 
3 11 0.11 3/4 
4 100 0.001 1/8 
5 101 0.101 5/8 
6 110 0.011 3/8 
7 111 0.111 7/8 


Table 5.1. Illustration of radical inverse function wy, in base b = 2. 


Figure 5.1 illustrates how the base-2 Van der Corput sequence fills the 
unit interval. The kth row of the array in the figure shows the first k nonzero 
elements of the sequence; each row refines the previous one. The evolution of 
the point set is exemplified by the progression from the seventh row to the last 
row, in which the “sixteenths” are filled in. As these points are added, they 
appear on alternate sides of 1/2: first 1/16, then 9/16, then 5/16, and so on. 
Those that are added to the left of 1/2 appear on alternate sides of 1/4: first 
1/16, then 5/16, then 3/16, and finally 7/16. Those on the right side of 1/2 
similarly alternate between the left and right sides of 3/4. Thus, while a naive 
refinement might simply insert the new values in increasing order 1/16, 3/16, 
5/16,..., 15/16, the Van der Corput inserts them in a maximally balanced 
way. 

The effect of the size of the base b can be seen by comparing Figure 5.1 
with the Van der Corput sequence in base 16. The first 15 nonzero elements 
of this sequence are precisely the values appearing the last row of Figure 5.1, 
but now they appear in increasing order. The first seven values of the base-16 
sequence are all between 0 and 1/2, whereas those of the base-2 sequence are 
spread uniformly over the unit interval. The larger the base, the greater the 
number of points required to achieve uniformity. 
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1 
2 
1 1 
4 2 
1 1 3 
4 2 4 
a a 1 3 
8 4 2 4 
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8 4 8 2 8 4 
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8 4 8 2 8 4 8 
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dit i o wees 2 
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16 8 4 16 8 2 16 8 4 8 
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16 8 16 4 16 8 2 16 8 4 16 8 
iaig 145 3 1 O88 aia as z 
16 8 16 4 16 8 2168 16 4 16 8 
DS i ie 20 A 8 1s 
16 8 16 4 16 8 16 2 16 8 16 4 16 8 
ibat bse Ce ee ee da7 e 
16 8 16 4 16 8 16 2 16 8 16 4 16 8 16 


Fig. 5.1. Illustration of the Van der Corput sequence in base 2. The kth row of this 
array shows the first k nonzero elements of the sequence. 


Theorem 3.6 of Niederreiter [281] shows that all Van der Corput sequences 
are low-discrepancy sequences. More precisely, the star discrepancy of the 
first n elements of a Van der Corput sequence is O(log n/n), with an implicit 
constant depending on the base b. 


5.1.3 The Koksma-Hlawka Bound 


In addition to their intuitive appeal as indicators of uniformity, discrepancy 
measures play a central role in bounding the error in the approximation (5.2). 
The key result in this direction is generally known as the Koksma-Hlawka 
inequality after a one-dimensional result published by Jurjen Koksma in 1942 
and its generalization by Edmund Hlawka in 1961. This result bounds the 
integration error in (5.2) by the product of two quantities, one depending 
only on the integrand f, the other — the star discrepancy of £1,..., £n — 
depending only on the point set used. 


Finite Variation 


The bound depends on the integrand f through its Hardy-Krause variation, 
which we now define, following Niederreiter [281]. For this we need f defined 
(and finite) on the closed unit hypercube (0, 1]%. Consider a rectangle of the 


form 
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Jt | ts | Ker uau], 
with 0 < u; < ut <1,7= 1,...,d. Each vertex of J has coordinates of 


the form ux, Let E(J) be the set of vertices of J with an even number of + 
superscripts and let (J) contain those with an odd number of + superscripts. 


Define 
Alfi ID= X fa- X fm; 


ucElJ) ucO(J) 


this is the sum of f(u) over the vertices of J with function values at adjacent 
vertices given opposite signs. 

The unit hypercube can be partitioned into a set P of rectangles of the 
form of J. Letting P range over all such partitions, define 


VO(f) = sup X [A(f;J)]. 


JEP 


This is a measure of the variation of f. Niederreiter [281, p.19] notes that 


vos ff 


if the partial derivative is continuous over [0,1]¢. This expression makes the 
interpretation of V‘ more transparent, but it should be stressed that f need 
not be differentiable for V‘)(f) to be finite. 

For any 1 < k < d and any 1 < îi < i2 <- < ik < d, consider the 
function on [0,1]* defined by restricting f to points (u1,..., uq) with u; = 1 
if j € {ij,... ip} and (ui,..., Ui) ranging over all of {0,1]*. Denote by 
VA f;ii,..., ik) the application of V“*) to this function. Finally, define 


dui ginni duq, 


E o 
ðu- Oud 


d 
V=; 2, Vefir esie) (a 


k=1 1< <- <ik Id 


This is the variation of f in the sense of Hardy and Krause. 
We. can now state the Koksma-Hlawka bound: if the function f has finite 
Hardy-Krause variation V (f), then for any £1,..., £n € [0,1)$, 


sbre- f 


7 


<V(f)D"(a1,.-+;2n). Ga) 


p f(u) du 


As promised, this result bounds the integration error through the product of 
two terms. The first term is a measure of the variation of the integrand; the 
second term is a measure of the deviation from uniformity of the points at 
which the integrand is evaluated. 

Theorem 2.12 of Niederreiter [281] shows that (5.10) is a tight bound in 
the sense that for each z1,..., £n and € > 0 there is a function f for which the 
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error on the left comes within € of the bound on the right. The function can 
be chosen to be infinitely differentiable, so in this sense (5.10) is tight even 
for very smooth functions. 

It is natural to contrast the Koksma-Hlwaka inequality with the error 
information available in ordinary Monte Carlo. To this end, let U, U1, U2,... be 
independent and uniformly distributed over the d-dimensional unit hypercube 
and let o$ = Var[|f(U)|. From the central limit theorem, we know that 


O 


Le f 
n a o ha MELONE OP a 


with probability approximately equal to 1 — 6, with —2z5/2 the 6/2 quantile of 
the standard normal distribution. From Chebyshev’s inequality we know that 


for any 6 > 0, 
Í n 
ESOU- | Hudu 
i [0,1]? 


0,1 


(5.11) 


(5.12) 


< JF 
= ‘on y) 
with probability at least 1—6. The following observations are relevant in com- 
paring the quasi-Monte Carlo and ordinary Monte Carlo error information: 


o The Koksma-Hlawka inequality (5.10) provides a strict bound on the inte- 
gration error, whereas (5.11) and (5.12) are probabilistic bounds and (5.11) 
requires n to be large. In both (5.11) and (5.12) we may, however, choose 
ô > 0 to bring the probability 1 — 6 arbitrarily close to 1. 

o Both of the terms V(f) and D*(z1,...,£n) appearing in the Koksma- 
Hlawka inequality are difficult to compute — potentially much more so 
than the integral of f. In contrast, the unknown parameter op in (5.11) and 
(5.12) is easily estimated from f(U;),...,f(Un) with negligible additional 
computation. 

o In cases where V(f) and D*(a1,...,%,) are known, the Koksma-Hlawka 
bound is often found to grossly overestimate the true error of integration. 
In contrast, the central limit theorem typically provides a sound and infor- 
mative measure of the error in a Monte Carlo estimate. 

o The condition that V (f) be finite is restrictive. It requires, for example, that 
f be bounded, a condition often violated in option pricing applications. 


In light of these observations, it seems fair to say that despite its theoret- 
ical importance the Koksma-Hlawka inequality has limited applicability as a 
practical error bound. This is a shortcoming of quasi-Monte Carlo methods in 
comparison to ordinary Monte Carlo methods, for which effective error infor- 
mation is readily available. The most important consequence of the Koksma- 
Hlawka inequality is that it helps guide the search for effective point sets and 
sequences by making precise the role of discrepancy. 

The Koksma-Hlawka inequality is the best-known example of a set of re- 
lated results. Hickernell [181] generalizes the inequality by extending both the 
star discrepancy and the Hardy-Krause variation using more general norms. 
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One such bound uses an Lo discrepancy defined by replacing the maximum 
absolute deviation in (5.3) with a root-mean-square deviation. An analog of 
(5.10) then holds for this notion of discrepancy, with V(f) replaced by an Le 
notion of variation. An advantage of the Ly discrepancy is that it is compar- 
atively easy to calculate — through simulation, for example. 

Integrands arising in derivative pricing applications sometimes vanish off 
a subset of [0,1)% (once formulated as functions on the hypercube) and may 
be discontinuous at the boundary of this domain. This is typical of barrier op- 
tions, for example. Such integrands usually have infinite variation, as explained 
in Figure 5.2. The Koksma-Hlawka inequality is therefore uninformative for 
a large class of interesting integrands. An important variant of (5.10), one 
of several cited in Niederreiter [281], p.21, applies to integrals over arbitrary 
convex subsets of the hypercube. Thus, if our integrand would have finite vari- 
ation but for the presence of the indicator function of a convex set, this result 
allows us to absorb the indicator into the integration domain and obtain a 
bound on the integration error. However, the bound in this case involves the 
isotropic discrepancy which, as we noted in Section 5.1.1, exhibits a much 
stronger dependence on dimension. 

This points to another limitation of the Koksma-Hlawka bound, at least 
from the perspective of our intended application. The Koksma-Hlawka result 
is oriented to the axes of the hypercube, through the definitions of both V(f) 
and the star discrepancy. The indicator of a rectangle, for example, has finite 
variation if the rectangle is parallel to the axes but infinite variation if the rec- 
tangle is rotated, as illustrated in Figure 5.2. This focus on a particular choice 
of coordinates seems unnatural if the function f is the result of transform- 
ing a simulation algorithm into a function on the unit hypercube; dropping 
this focus leads to much larger error bounds with a qualitatively different 
dependence on dimension (cf. Matousek [257]). In Section 5.5.2, we discuss 
applications of QMC methods that take account of more specific features of 
integrands f arising in derivative pricing applications. 


5.1.4 Nets and Sequences 


Despite its possible shortcomings as a practical bound on integration error, 
the Koksma-Hlawka inequality (5.10) nevertheless suggests that constructing 
point sets and sequences with low discrepancy is a fruitful approach for numer- 
ical integration. A valuable tool for constructing and describing such point sets 
is the notion of a (t,m,d)-net and a (t, d)-sequence introduced by Niederre- 
iter [280], extending ideas developed in base 2 by Sobol’ [335]. These are more 
commonly referred to as (t, m, s)-nets and (t, s)-sequences; the parameter s in 
this terminology refers to the dimension, for which we have consistently used 
d. Briefly, a (t, m, d)-net is a finite set of points in [0, 1)? possessing a degree 
of uniformity quantified by t; a (t, d)-sequence is a sequence of points certain 
segments of which form (t, m, d)-nets. 
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Fig. 5.2. Variation of the indicator function of the shaded square. In the left panel, 
each small box has a A(f; J) value of zero except the one containing the corner of 
the shaded square; the variation remains finite. In the right panel, each small box 
has a |A(f; J)| value of 1, except the one on the corner. Because the boxes can be 
made arbitrarily small, the indicator function on the right has infinite variation. 


To formulate the defintions of these sets and sequences, we first need to 
define a b-ary box, also called an elementary interval in base b, with b > 2 an 
integer. This is a subset of [0, 1)? of the form 


with j; € {0,1,...} and a; € {0,1,..., b —1}. The vertices of a b-ary box thus 
have coordinates that are multiples of powers of 1/b, but with restrictions. In 
base 2, for example, [3/4, 1) and (3/4, 7/8) are admissible but [5/8, 7/8) is not. 
The volume of a b-ary box is 1/71* tJa, 

For integers 0 < t < m, a (t,m,d)-net in base b is a set of b™ points 
in [0,1)? with the property that exactly b’ points fall in each b-ary box of 
volume b’~™. Thus, the net correctly estimates the volume of each such b-ary 
box in the sense that the fraction of points b°/b™ that lie in the box equals 
the volume of the box. 

A sequence of points 21,22,... in [0,1)% is a (t,d)-sequence in base b if 
for all m > t each segment {a; : jb™ <i < (j +1)b™}, j = 0,1,..., is a 
(t, m, d)-net in base b. 

In these definitions, it should be evident that smaller values of t are asso- 
ciated with greater uniformity: with smaller t, even small b-ary boxes contain 
the right number of points. It should also be clear that, other things be- 
ing equal, a smaller base b is preferable because the uniformity properties 
of (t, m, d)-nets and (t,d)-sequences are exhibited in sets of b” points. With 
larger b, more points are required for these properties to hold. 

Figure 5.3 displays two nets. The 81 (= 3*) points in the left panel comprise 
a (0,4,2)-net in base 3. Dotted lines in the figure show 3-ary boxes with 
dimensions 1/9 x 1/9 and 1/27 x 1/3 containing one point each, as they must. 
(For points on the boundaries, recall our convention that intervals are closed 
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on the left and open on the right.) The right panel shows a (1,7, 2)-net in 
base 2 (with 2’ = 128 points) that is not a (0,7, 2)-net. The dotted lines in 
the figure show that 2-ary boxes with area 1/64 contain two points, but they 
also show boxes with dimensions 1/16 x 1/8 that do not contain any points. 


Fig. 5.3. Left panel shows 81 points comprising a (0, 4, 2)-net in base 3. Right panel 
shows 128 points comprising a (1,7,2)-net in base 2. Both include a point at the 


origin. 


Niederreiter [281] contains an extensive analysis of discrepancy bounds 
for (t, m, d)-nets and (t, d)-sequences. Of his many results we quote just one, 
demonstrating that (t,d)-sequences are indeed low-discrepancy sequences. 
More precisely, Theorem 4.17 of Niederreiter [281] states that if 71, 72,... 
is a (t, d)-sequence in base b, then for n > 2, 


: d t d—1 
D*(a1,..-,2n) < Cld, ppt Coen)” +0 (eer | | (5.13) 


The factor C'(d, b) (for which Niederreiter provides an explicit expression) and 
the implicit constant in the O(-) term do not depend on n or t. Theorem 4.10 
of Niederreiter [281] provides a similar bound for (t, m, d)-nets, but with each 
exponent of logn reduced by 1. 

In the next section, we describe several specific constructions of low- 
discrepancy sequences. The simplest constructions, producing Halton se- 
quences and Hammersley points, are the easiest to introduce, but they yield 
neither (t, d)-sequences nor (t, m, d)-nets. Faure sequences are (0, d)-sequences 
and thus optimize the uniformity parameter t; however, they require a base 
at least as large as the smallest prime greater than or equal to the dimension 
d. Sobol’ sequences use base 2 regardless of the dimension (which has compu- 
tational as well as uniformity advantages) but their t parameter grows with 


the dimension d. 
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5.2 Low-Discrepancy Sequences 


We turn now to specific constructions of low-discrepancy sequences in arbi- 
trary dimension d. We provide algorithms for the methods we consider and 
make some observations on the properties and relative merits of various se- 
quences. All methods discussed in this section build on the Van der Corput 
sequences discussed in Section 5.1.2. 


5.2.1 Halton and Hammersley 


Halton [165], extending work of Hammersley [168], provides the simplest con- 
struction and first analysis of low-discrepancy sequences in arbitrary dimen- 
sion d. The coordinates of a Halton sequence follow Van der Corput sequences 
in distinct bases. Thus, let 61,..., bq be relatively prime integers greater than 


1, and set 
DiS (VAR) E hre Vnk) R= OA 2a ns (5.14) 


with Y» the radical inverse function defined in (5.8). 

The requirement that the 6; be relatively prime is necessary for the se- 
quence to fill the hypercube. For example, the two-dimensional sequence de- 
fined by bı = 2 and bg = 6 has no points in [0,1/2) x [5/6,1). Because we 
prefer smaller bases to larger bases, we therefore take b;,...,5g to be the first 
d prime numbers. The two-dimensional cases, using bases 2 and 3, is illus- 
trated in Figure 5.4. With the convention that intervals are closed on the left 
and open on the right, each cell in the figure contains exactly one point. 


k pa(k) ak) 
0 0 0 
1 1/2 1/3 
2 1/4 2/3 
3 3/4 1/9 
4 1/8 4/9 
5 5/8 7/9 
6 3/8 2/9 
7 7/8 5/9 
8 1/16 8/9 
9 9/16 1/27 


10 5/16 10/27 
11 13/16 19/27 


0 1/4 1/2 3/4 1 


Fig. 5.4. First twelve points of two-dimensional Halton sequence. 


A word about zero: Some properties of low-discrepancy sequences are most 
conveniently stated by including a Oth point, typically zero itself, as in (5.14). 
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When the points are fed into a simulation algorithm, there is often good reason 
to avoid zero — for example, 6~1(0) = —oo. In practice, we therefore omit it. 
Depending on whether or not zo is included, x, is either the kth or (k + 1)th 
point in the sequence, but we always take x, to be the point constructed 
from the integer k, as in (5.14). Omission of xp has no bearing on asymptotic 
properties. 

Halton points form an infinite sequence. We can achieve slightly better 
uniformity if we are willing to fix the number of points n in advance. The n 
points 

{ (k/n, bp, (k), -Poa (K)) k = 0, 1, ee eg M — 1} 
with relatively prime b;,...,bg_1 form a Hammersley point set in dimension 
d. 

The star discrepancy of the first n Halton points in dimension d with 

relatively prime bases b;,...,6g satisfies 


] d d—1 
D*(xo,. : peti) < Ca(b1,. ; rene +O (Seek) j 


with Ca(bı, ...,ba) independent of n; thus, Halton sequences are indeed low- 
discrepancy sequences. The corresponding n-element Hammersley point set 
satisfies 


i o MEN, 


D* (£o, pae a) < Ca-1(61, are , bd—1 


The leading orders of magnitude in these bounds were established in Halton 
[165] and subsequently refined through work reviewed in Niederreiter [281, 
p.44]. 

A formula for Ca(bı,...,ba) is given in Niederreiter [281]. This upper 
bound is minimized by taking the bases to be the first d primes. With Cg 
denoting this minimizing value, Niederreiter [281], p.47, observes that 


a log Ca = 
dco dlogd 


? 


so the bounding constant Cg grows superexponentially. This indicates that 
while the Halton and Hammersley points exhibit good uniformity for fixed d 
as n increases, their quality degrades rapidly as d increases. 

The deterioration of the Halton sequence and Hammersley points in high 
dimensions follows from the behavior of the Van der Corput sequence with 
a large base. The Van der Corput sequence in base b consists of consecutive 
monotone segments of length b. If the base is large, the sequence produces long 
monotone segments, and projections of a Halton sequence onto coordinates 
using large bases will have long diagonal segments in the projected hypercube. 

This pattern is illustrated in Figure 5.5, which shows two projections of 
the first 1000 nonzero points of the Halton sequence in dimension 30. The left 
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panel is the projection onto the first two coordinates, which use bases 2 and 
3; the right panel is the projection onto the last two coordinates, which use 
bases 109 and 113, the 29th and 30th prime numbers. The impact of increasing 
the bases — and thus also of increasing the dimension — is evident from the 


figure. 


Fig. 5.5. First 1000 points of the Halton sequence in dimension 30. Left panel shows 
projection onto first two coordinates (bases 2 and 3); right panel shows projection 
onto last two coordinates (bases 109 and 113). 


As a possible remedy for the problem illustrated in Figure 5.5, Kocis and 
Whiten [213] suggest using a leaped Halton sequence 


Lk = (wo, (k£), Who (ke), say Wha (k£)), k= O, 1, 2, Kete 


for some integer £ > 2. They recommend choosing £ to be relatively prime to 
the bases b1,..., bg. 

This idea is illustrated in Figure 5.6, where we have applied it to a two- 
dimensional Halton sequence with bases 109 and 113, the same bases used 
in the right panel of Figure 5.5. Each panel of Figure 5.6 shows 1000 points, 
using leaps £ = 3, £ = 107 (the prime that precedes 109), and £ = 127 
(the prime that succeeds 113). The figures suggest that leaping can indeed 
improve uniformity, but also that its effect is very sensitive to the choice of 
leap parameter £. 

The decline in uniformity of Halton sequences with increasing dimension is 
inherent to their construction. Several studies have concluded through numeri- 
cal experiments that Halton sequences are not competitive with other methods 
in high dimensions; these studies include Fox [126], Kocis and Whiten [213], 
and, in financial applications, Boyle et al. [53] and Paskov [295]. An excep- 
tion is Morokoff and Caflisch [272], where Halton sequences are found to be 
effective on a set of test problems; see also the comments of Matoušek [257, 
p.543] supporting randomized Halton sequences. 
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Fig. 5.6. First 1000 points of leaped Halton sequence with bases 109 and 113. From 
left to right, the leap parameters are £ = 3, £ = 107, and £ = 127. 


Implementation 


Generating Halton points is essentially equivalent to generating a Van der 
Corput sequence, which in turn requires little more than finding base-b ex- 
pansions. We detail these steps because they will be useful later as well. 

Figure 5.7 displays an algorithm to compute the base-b expansion of an 
integer k > 0. The function returns an array a whose elements are the co- 
efficients of the expansion ordered from most significant to least. ‘Thus, B- 
ARY(6,2) is (1,1,0) and B-ARY(135,10) is (1,3, 5). 


B-ARY(k, 5) 

a0 

if (k > 0) 
Jmax — [log(k)/ log(b) | 
a—(0,0,...,0) [length jmax + 1] 


q Pan fh? max 
fOr 9 S: Ly Jena) 
a(j) — |k/q] 
k k- q xali) 
q — q/b 
return a 


Fig. 5.7. Function B-ARY(k,b) returns coefficients of base-b expansion of integer k 
in array a. Rightmost element of a is least significant digit in the expansion. 


To generate elements £n; , £n1+1;---, Zn Of the Van der Corput sequence 
in base b, we need the expansions of k = n1, nı +1,..., n2. But computing all 
of these through calls to B-ARY is wasteful. Instead, we can use B-ARY to 
expand nı and then update the expansion recursively. A function NEXTB- 
ARY that increments a base-b expansions by 1 is displayed in Figure 5.8. 
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NEX'TB-ARY (ain, b) 
m + length(ain), carry TRUE 
fOr eS assal 
if carry 
if (ain(t) =b — 1) 


Aout (z) — 0 


else 
Aout (i) e ain (2) +1 
carry — FALSE 
else 
aout (i) — ain (i) 
if carry aout — (1, acut(1),..., aout(m)) 
return aout 


Fig. 5.8. Function NEXTB-ARY (ain, b) returns coefficients of base-b expansion of 
integer k + 1 in array aout, given coefficients for integer k in array ain. 


Elements £n, .--, Zna Of the base-b Van der Corput sequence can now be 
calculated by making an initial call to B-ARY (nı, b) to get the coefficients of 
the expansion of nı and then repeatedly applying NEXTB-ARY to get sub- 
sequent coefficients. If the array a for a point n has m elements, we compute 


£n as follows: 


In — 0, q — 1/b 
lory =h 
Ln, Teresa a7 1) 
q — q/b 
This evaluates the radical-inverse function yy. By applying the same procedure 
with prime bases 6;,...,bg, we construct Halton points in dimension d. 

As noted by Halton [165], it is also easy to compute yp(k + 1) recursively 
from Yp(k) without explicitly calculating the base-b expansion of either k or 
k+1. Halton and Smith [166] provide a numerically stable implementation of 
this idea, also used in Fox [126]. We will, however, need the base-b expansions 
for other methods. 


5.2.2 Faure 


We noted in the previous section that the uniformity of Halton sequences de- 
grades in higher dimensions because higher-dimensional coordinates are con- 
structed from Van der Corput sequences with large bases. In particular, the 
dth coordinate uses a base at least as large as the dth prime, and this grows 
superexponentially with d. Faure [116] developed a different extension of Van 
der Corput sequences to multiple dimensions in which all coordinates use a 
common base. This base must be at least as large as the dimension itself, but 
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can be much smaller than the largest base used for a Halton sequence of equal 


dimension. 

In a d-dimensional Faure sequence, the coordinates are constructed by 
permuting segments of a single Van der Corput sequence. For the base 6 we 
choose the smallest prime number greater than or equal to d. As in (5.7), let 
a(k) denote the coefficients in the base-b expansion of k, so that 


k= 5 ar( kb! (5.15) 


@=0 


The ¿th coordinate, 7 = 1,...,d, of the kth point in the Faure sequence is 


given by 
co „ (i) 
yj (k) 
D (5.16) 
j=1 
where a 
py Nis 2 apg 
y; (k) = 2 (, 7 : (i— 1) a(k) mod b, (5.17) 
with 
m\  {m!/(m—n)Iln!,m > n, 
n/ 19, otherwise, 
and 0! = 1. 


Each of the sums in (5.15)-(5.17) has only a finite number of nonzero 
terms. Suppose the base-b expansion of k in (5.15) has exactly r terms, mean- 
ing that ar—ı(k) # 0 and a(k) = 0 for all £ > r. Then the summands in 
(5.17) vanish for £ > r. If j >r+41, then the summands for £ = 0,...,r— 1 
also vanish, so yi? (kì = 0 ifj >r-+1, which implies that (5.16) has at most 
r nonzero terms. The construction may thus be viewed as the result of the 


matrix-vector calculation 


vi, (E) ao (k) 
“(k ai(k) 
2 Á a eg mod b, (5.18) 
yr (k) ar—1{k) 


where C is the r x r matrix with entries 
— 1 
C(m,n) = (” ae (5.19) 


for n > m and zero otherwise. With the convention that 0° = 1 and 0! = 0 for 
j > 0, this makes C( the identity matrix. (Note that in (5.18) the coefficients 
a;(k) are ordered from least significant to most significant.) ; 

These generator matrices have the following cyclic properties: 
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and for i > 2, C® mod i is the identity matrix. 

To see the effect of the transformations (5.16)-(5.17), consider the in- 
tegers from 0 to b” — 1. These are the integers whose base-b expansions 
have r or fewer terms. As k varies over this range, the vector a(k) = 
(ao(k),a1(k),...,@p—1(k))' varies over all b” vectors with elements in the 
set {0,1,...,b — 1}. The matrices C) have the property that the product 
C® a(k) (taken mod b) ranges over exactly the same set. In other words, 


CMa(k) modb, 0<k<b" 


is a permutation of a(k), 0 < k < b". In fact, the same is true if we restrict k 
to any range of the form jb” < k < (j + 1)b", with O < j < b — 1. It follows 
that the ith coordinate a) of the points xg, for any such set of k, form a 
permutation of the segment y»(k), jb? < k < (j + 1)b"t', of the Van der 


Corput sequence. 
As a simple illustration, consider the case r = 2 and b = 3. The generator 


matrices are 
eo aa Lik (2) — mQ)qad) — 12 
C = (01): C= CWC ee ; 


For k = 0,1,...,8, the vectors a(k) are 


(5) (0) Co) G)-G)-G)-@)-@) (2): 


The vectors C“a(k) (mod b) are 


(5) (o) (0) Ca) G)-G)-(@)-@) (2) 


And the vectors C)a(k) (mod b) are 


(0)+(o) Co) (ir) G)-C)-G)-(@)-@). 


Now we apply (5.16) to convert each of these sets of vectors into fractions, by 
premultiplying each vector by (1/3, 1/9). Arranging these fractions into three 
rows, we get 


0 1/3 2/3 1/9 4/9 7/9 2/9 5/9 8/9 


0 1/3 2/3 4/9 7/9 1/9 8/9 2/9 5/9 
0 1/3 2/3 7/9 1/9 4/9 5/9 8/9 2/9 


The first row gives the first nine elements of the base-3 Van der Corput se- 
quence and the next two rows permute these elements. Finally, by taking each 
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column in this array as a point in the three-dimensional unit hypercube, we 
get the first nine points of the three-dimensional Faure sequence. 

Faure [116] showed that the discrepancy of the d-dimensional sequence 
constructed through (5.16)—(5.17) satisfies 


(log n)4-1 | 


] d 
D (Bice iy, =p OER +o( 
n n 


with Fy depending on d but not n; thus, Faure sequences are indeed low- 
discrepancy sequences. In fact, Fy — 0 quickly as d — oo, in marked contrast 
to the increase in the constant for Halton sequences. 

In the terminology of Section 5.1.4, Faure sequences are (0, d)-sequences 
and thus achieve the best possible value of the uniformity parameter t. The 
example in the left panel of Figure 5.3 is the projection onto dimensions two 
and three of the first 81 points of the three-dimensional Faure sequence with 
base 3. 

As a consequence of the definition of a (¢,d)-sequence, any set of Faure 
points of the form {zp : jb" < k < (j +1)b™} with 0 < j < b—1 and m > lis 
a (0, m, d)-net, which we call a Faure net. (Recall that zo = 0.) The discussion 
following (5.19) may then be summarized as stating that, over a Faure net, 
all one-dimensional projections are the same, as each is a permutation of a 
segment of the Van der Corput sequence. 

The cyclic properties of the generator matrices C™ have implications for 
higher dimensional projections as well. The projection of the points in a Faure 
net onto the zth and jth coordinates depends only on the distance 7-2, modulo 
the base b. Thus, if the dimension equals the base, the b(b—1) two-dimensional 
projections comprise at most b — 1 distinct sets in [0,1)?. If we identify sets 
in [0,1)? that result from interchanging coordinates, then there are at most 
(b — 1)/2 distinct projections. Similar conclusions hold for projections onto 
more than two coordinates. _ 

This phenomenon is illustrated in Figure 5.9, which is based on a Faure 
net in base 31, the 961 points constructed from k = 5(31)?,...,6(31)* — 1. 
The projection onto coordinates 1 and 2 is identical to the projection onto co- 
ordinates 19 and 20. ‘The projection onto coordinates 1 and 31 would look the 
same as the other two if plotted with the axes interchanged because modulo 
31, 1 is the successor of 31. 


Implementation 


The construction of Faure points builds on the construction of Van der Corput 
sequences in Section 5.2.1. To generate the d coordinates of the Faure point £k 
in base b, we record the base-b expansion of k in a vector, multiply the vector 
(mod b) by a generator matrix, and then convert the resulting vector to a point 
in the unit interval. We give a high-level description of an implementation, 
omitting many programming details. Fox [126] provides FORTRAN code to 


generate Faure points. 
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Fig. 5.9. Projections of 961 Faure points in 31 dimensions using base 31. From 
left to right, the figures show coordinates 1 and 2, 19 and 20, and 1 and 31. The 
first two are identical; the third would look the same as the others if the axes were 


interchanged. 


A key step is the construction of the generator matrices (5.19). Because 
these matrices have the property that C®) is the ith power of C®, it is 
possible to construct just C?) and then recursively evaluate products of the 
form Ca as CŒ CC-Ða. However, to allow extensions to more general 
matrices, we do not take advantage of this-in our implementation. 

As noted by Fox [126] (for the case i = 1), in calculating the matrix 
entries in (5.19), evaluation of binomial coefficients can be avoided through 


the recursion 
n+1 B n A n 
k+1/ \k+1 kJ’ 


n > k > 0. Figure 5.10 displays a function FAUREMAT that uses this recur- 
sion to construct C™. 


FAUREMAT(r, i) 
C(i,1)<- 1 
tO (2 = Zo nae? 
C(m,m) — 1 
C(1, m) ixC(1,m — 1) 


Orm dT 
Em = 2 35,90 1 
C(m,n) = C(m — 1,n — 1) +72* C(m,n — 1) 


return C 


Fig. 5.10. Function FAUREMAT\(r, 7) returns r x r generator matrix Cc, 


The function FAUREPTS in Figure 5.11 uses FAUREMAT to generate 
Faure points. The function takes as inputs the starting index no, the total 
number of points to generate npts, the dimension d, and the base b. The 
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column in this array as a point in the three-dimensional unit hypercube, we 
get the first nine points of the three-dimensional Faure sequence. 

Faure [116] showed that the discrepancy of the d-dimensional sequence 
constructed through (5.16)—(5.17) satisfies 

d d—1 
D Gresa pen +0 (Sane) , 
n n 
with Fy depending on d but not n; thus, Faure sequences are indeed low- 
discrepancy sequences. In fact, Fy — 0 quickly as d — ov, in marked contrast 
to the increase in the constant for Halton sequences. 

In the terminology of Section 5.1.4, Faure sequences are (0, d)-sequences 
and thus achieve the best possible value of the uniformity parameter t. The 
example in the left panel of Figure 5.3 is the projection onto dimensions two 
and three of the first 81 points of the three-dimensional Faure sequence with 
base 3. 

As a consequence of the definition of a (t,d)-sequence, any set of Faure 
points of the form {zp : jb™ < k < (j+1)b™} withO < j < b—1 and m > 1is 
a (0, m, d)-net, which we call a Faure net. (Recall that zo = 0.) The discussion 
following (5.19) may then be summarized as stating that, over a Faure net, 
all one-dimensional projections are the same, as each is a permutation of a 
segment of the Van der Corput sequence. 

The cyclic properties of the generator matrices C“) have implications for 
higher dimensional projections as well. The projection of the points in a Faure 
net onto the ith and jth coordinates depends only on the distance j—i, modulo 
the base b. Thus, if the dimension equals the base, the b(6—1) two-dimensional 
projections comprise at most b — 1 distinct sets in [0,1)’. If we identify sets 
in [0,1)? that result from interchanging coordinates, then there are at most 
(b — 1)/2 distinct projections. Similar conclusions hold for projections onto 
more than two coordinates. _ 

This phenomenon is illustrated in Figure 5.9, which is based on a Faure 
net in base 31, the 961 points constructed from k = 5(31)?,...,6(31)? — 1. 
The projection onto coordinates 1 and 2 is identical to the projection onto co- 
ordinates 19 and 20. The projection onto coordinates 1 and 31 would look the 
same as the other two if plotted with the axes interchanged because modulo 
31, 1 is the successor of 31. 


Implementation 


The construction of Faure points builds on the construction of Van der Corput 
sequences in Section 5.2.1. To generate the d coordinates of the Faure point x; 
in base b, we record the base-b expansion of k in a vector, multiply the vector 
(mod b) by a generator matrix, and then convert the resulting vector to a point 
in the unit interval. We give a high-level description of an implementation, 
omitting many programming details. Fox [126] provides FORTRAN code to 


generate Faure points. 
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Fig. 5.9. Projections of 961 Faure points in 31 dimensions using base 31. From 
left to right, the figures show coordinates 1 and 2, 19 and 20, and 1 and 31. The 
first two are identical; the third would look the same as the others if the axes were 


interchanged. 


A key step is the construction of the generator matrices (5.19). Because 
these matrices have the property that C™ is the ith power of CW, it is 
possible to construct just C™ and then recursively evaluate products of the 
form Ca as C4 C-Da. However, to allow extensions to more general 
matrices, we do not take advantage of this in our implementation. 

As noted by Fox [126] (for the case i = 1), in calculating the matrix 
entries in (5.19), evaluation of binomial coefficients can be avoided through 


the recursion 
n+1 E n i n 
k+1/) \k+1 kJ’ 


n > k > 0. Figure 5.10 displays a function FAUREMAT that uses this recur- 
sion to construct C™, 


FAUREMAT\(r, 7) 
C(1,1)- 1 
for m = 2er? 
C(m,m) — 1 
C(1,m) —ix*« C(1,m-—1) 


torn =g asst 
for m = 2,...,n— 1 
C(m,n) = C(m— 1,n—1)+i*C(m,n-— 1) 
return C 


Fig. 5.10. Function FAUREMAT (r,i) returns r x r generator matrix C™. 


The function FAUREPTS in Figure 5.11 uses FAUREMAT to generate 
Faure points. The function takes as inputs the starting index no, the total 
number of points to generate npts, the dimension d, and the base b. The 
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starting index no must be greater than or equal to 1 and the base 6 must be a 
prime number at least as large as d. One could easily modify the function to 
include an array of prime numbers to save the user from having to specify the 
base. Calling FAUREPTS with no = 1 starts the sequence at the first nonzero 
point. Fox [126] recommends starting at no = b+ — 1 to improve uniformity. 
The advantage of generating npts points in a single call to the function lies 
in constructing the generator matrices just once. In FAUREPTS, rmax is the 
number of places in the base-b representation of no + Npts — 1, so the largest 
generator matrices needed are Tmax X Tmax. We construct these and then use 
the submatrices needed to convert smaller integers. The variable r keeps track 
of the length of the expansion a of the current integer k, so we use the first r 
rows and columns of the full generator matrices to produce the required r x r 
generator matrices. The variable r increases by one each time k reaches dnext, 


the next power of b. 


FAUREPTS(no, npts, d, b) 
Nmax — No + Npts — I, Tmax — 1+ | log (Nmax )/ log(b) | 
P e0 mae x d] y lens Tmax) =0 
r 1 + |log(max(1, no — 1))/log(b)|,  qnext — b” 
a — B-ARY (no — 1,0, 1) 
fori = lsd =] 
C@ — FAUREMAT(rmax, 4) 
bpwrs  (1/b, 1/b*,...,1/b"™=*) 
for k= nises Nines 
a ~ NEXTB-ARY(a,b) 
if (k = next ) 
rr+i 
next *— b x Qnext 
KONAR hy gage 
P(k — no +1,1) = P(k — no + 1,1) + bpwrs(j) *a(r = j + 1) 


lor = d 
for m=1,...,7 
fOr aS laket 


y(m) — y(m) + CO (m,n) xa(r =n +1) 
y(m) — y(m) mod b 
P(k — no + 1,i) — P(k — no + 1, i) + bpwrs(m) * y(m) 
y(m) — 0 


return P 


Fig. 5.11. Function FAUREPTS (no, npts, d, b) returns npts X d array whose rows are 
coordinates of d-dimensional Faure points in base b, starting from the noth nonzero 


point. 
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The loop over m and n near the bottom of FAUREPTS executes the 
matrix-vector product in (5.18). The algorithm traverses the elements of the 
vector a from the highest index to the lowest because the variable a in the 
algorithm is flipped relative to the vector of coefficients in (5.18). This re- 
sults from a conflict in two notational conventions: the functions B-ARY and 
NEXTB-ARY follow the usual convention of ordering digits from most signif- 
icant to least, whereas in (5.18) and the surrounding discussion we prefer to 
start with the least significant digit. 

In FAUREPTS, we apply the mod-b operation only after multiplying each 
vector of base-b coefficients by a generator matrix. We could also take the re- 
mainder mod b after each multiplication of an element of C™ by an element 
of a, or in setting up the matrices CC®); indeed, we could easily modify FAU- 
REMAT to return C™ mod b by including b as an argument of that function. 
Taking remainders at intermediate steps can eliminate problems from over- 
flow, but requires additional calculation. 

An alternative construction of Faure points makes it possible to replace the 
matrix-vector product in (5.18) (the loop over m and n in FAUREPTS) with a 
single vector addition. This alternative construction produces permutations of 
Faure points, rather than the Faure points themselves. It relies on the notion 
of a Gray code in base b; we return to it in the next section after a more 
general discussion of Gray codes. 


5.2.3 Sobol’ 


Sobol’ [335] gave the first construction of what is now known as a (t,d)- 
sequence (he used the name LP,-sequence). The methods of Halton and Ham- 
mersley have low discrepancy, but they are not (t,d)-sequences or (t, m, d)- 
nets. Sobol’s construction can be succinctly contrasted with Faure’s as fol- 
lows: Whereas Faure points are (0, d)-sequences in a base at least as large 
as d, Sobol’s points are (t,d)-sequences in base 2 for all d, with values of t 
that depend on d. Faure points therefore achieve the best value of the uni- 
formity parameter t, but Sobol’ points have the advantage of a much smaller 
base. Working in base 2 also lends itself to computational advantages through 
bit-level operations. 

Like the methods of Halton, Hammersley, and Faure, Sobol’ points start 
from the Van der Corput sequence, but now exclusively in base 2. The various 
coordinates of a d-dimensional Sobol’ sequence result from permutations of 
segments of the Van der Corput sequence. As in Section 5.2.2, these permu- 
tations result from multiplying (binary) expansions of consecutive integers by 
a set of generator matrices, one for each dimension. The key difference lies in 
the construction of these generator matrices. 

All coordinates of a Sobol’ sequence follow the same construction, but each 
with its own generator. We may therefore begin by discussing the construction 
of a single coordinate based on a generator matrix V. The elements of V are 
equal to 0 or 1. Its columns are the binary expansions of a set of direction 
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numbers vU1,...,Ur. Here, r could be arbitrarily large; in constructing the 
kth point in the sequence, think of r as the number of terms in the binary 
expansion of k. The matrix V will be upper triangular, so regardless of the 
number of rows in the full matrix, it suffices to consider the square matrix 


consisting of the first r rows and columns. 
Let a(k) = (ao(k),...,@r—1(k))' denote the vector of coefficients of the 


binary representation of k, so that 


k = ao(k) + 2a1(k) +--+ 277ta, (k). 


Let 
a oN 
m ) ZN i ) mod 2; (5.20) 
yr(k) ~ 


then y1(k),...,yr(k) are coefficients of the binary expansion of the kth point 
in the sequence; more explicitly, 
ylk) _ y2(k) yr(k) 
aaa I 4 a Oe aes 
If V is the identity matrix, this produces the Van der Corput sequence in base 
2. 
The operation in (5.20) can be represented as 


aolku: Bay (k)vo @- +: PB ap_i(k)up, (5.21) 


where the v; are the columns of (the r x r submatrix of) V and ® denotes 
binary addition, 


OGB0=0, V0G1L1=160=1, 1G61=0. 


This formulation is useful in computer implementation. If we reinterpret vj 
as the computer representation of a number (i.e., as a computer word of bits 
rather than as an array) and implement @ through a bitwise XOR. operation, 
then (5.21) produces the computer representation of xx. 

We turn now to the heart of Sobol’s method, which is the specification of 
the generator matrices — equivalently, of the direction numbers v;. We use 
the same symbol vj to denote the number itself (a binary fraction) as we use 
to denote the vector encoding its binary representation. For a d-dimensional 
Sobol’ sequence we need d sets of direction numbers, one for each coordinate; 
for simplicity, we continue to focus on a single coordinate. 

Sobol’s method for choosing a set of direction numbers starts by selecting 
a primitive polynomial over binary arithmetic. This is a polynomial 


gI + ast H+ Heg1z +1, (5.22) 


with coefficients c; in {0,1}, satisfying the following two properties (with 
respect to binary arithmetic): 
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o it is irreducible (i.e., it cannot be factored); 
o the smallest power p for which the polynomial divides z? + 1 is p = 2% — 1. 


Irreducibility implies that the constant term 1 must indeed be present as 
implied by (5.22). The largest power q with a nonzero coefficient is the degree 
of the polynomial. The polynomials 


etl, starl a trii aiy ri 


are all the primitive polynomials of degree one, two, or three. 

Table 5.2 lists 53 primitive polynomials, including all those of degree 8 or 
less. (A list of 360 primitive polynomials is included in the implementation 
of Lemieux, Cieslak, and Luttmer [231].) Each polynomial in the table is en- 
coded as the integer defined by interpreting the coefficients of the polynomial 
as bits. For example, the integer 37 in binary is 100101, which encodes the 
polynomial zë + xz? + 1. Table 5.2 includes a polynomial of degree 0; this is 
a convenient convention that makes the construction of the first coordinate 
of a multidimensional Sobol’ sequence consistent with the construction of the 
other coordinates. 


Degree|Primitive Polynomials 


11 (x? +a41), 13 (z? +27 +1) 

19, 25 

37, 59, 47, 61, 55, 41 

67, 97, 91, 109, 103, 115 

131, 193, 137, 145, 143, 241, 157, 185, 167, 
229, 171, 213, 191, 253, 203, 211, 239, 247 
285, 369, 299, 425, 301, 361, 333, 357, 351, 
501, 355, 397, 391, 451, 463, 487 


Table 5.2. Primitive polynomials of degree 8 or less. Each number in the right 
column, when represented in binary, gives the coefficients of a primitive polynomial. 


The polynomial (5.22) defines a recurrence relation 
mj = 2a mj- D ceomj-2 B+ + B21 eg_yMj_g41 DLM- PMi. (5.23) 


The m; are integers and © may again be interpreted either as binary addi- 
tion of binary vectors (by identifying m; with its vector of binary coefficients) 
or as bit-wise XOR applied directly to the computer representations of the 
operands. By convention, the recurrence relation defined by the degree-0 poly- 
nomial is m; = 1. From the m,, the direction numbers are defined by setting 


Uj =m,;/2). 
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For this to fully define the direction numbers we need to specify initial 
values m1,..-,7M%q for (5.23). A minimal requirement is that each initializing 
m; be an odd integer less than 2’; all subsequent m; defined by (5.23) will 
then share this property and each v; will lie strictly between 0 and 1. More 
can be said about the proper initialization of (5.23); we return to this point 
after considering an example. 

Consider the primitive polynomial 


r’ +r’+1 
with degree q = 3. The recurrence (5.23) becomes 
Mi = 2Mj—1 p 8m;—3 p M7 —3 


and suppose we initialize it with mı = 1, mə = 3, m3 = 3. The next two 
elements in the sequence are as follows: 


m4 = (2-3)6(8-1)@1 
= 0110 @ 1000 ® 0001 
= 1111 
=15 
(2-15) @ (8-3) 63 
= 11110 @ 11000 6 00011 
= 00101 
=5 


ms 


From these five values of m;, we can calculate the corresponding values of 
vj by dividing by 2’. But dividing by 27 is equivalent to shifting the “binary 
point” to the left 7 places in the representation of m;. Hence, the first five 
direction numbers are 


i S05, Hs S018, Os =] 00M: e011 es 0.00101, 


and the corresponding generator matrix is 


11010 
01110 

V=|{00111]. (5.24) 
00010 | 
00001 


Observe that taking m; = 1,7 = 1,2,..., (ie., using the degree-0 polynomial) 
produces the identity matrix. 

Finally, we illustrate the calculation of the sequence 21, %2,.... For each 
k, we take the vector a(k) of binary coefficients of k and premultiply it (mod 
2) by the matrix V. The resulting vector gives the coefficients of a binary 
fraction. The first three vectors are 
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1 1 0 1 1 0 
0 0 1 1 1 1 
viol=l/ol, vloļ=lļlolļl, vlo]=|ol], 
0 0 0 0 0 0 
0 0 0 0 0 0 


which produce the points 1/2, 3/4, and 1/4. For k = 29, 30, 31 (the last three 
points that can be generated with a 5 x 5 matrix), we have 


m = =e OD e 
| 

= = HS CO © 

eS CE 
|| 

= pe re Re © 

pt = e me e 
| 

= = = e e 


which produce 7/32, 15/32, and 31/32. 

Much as in the case of Faure points, this procedure produces a permutation 
of the segment p(k), 2171 < k < 2", of the Van der Corput sequence when 
V is r x r. This crucial property relies on the fact that V was constructed 
from a primitive polynomial. 


Gray Code Construction 


Antanov and Saleev |18] point out that Sobol’s method simplifies if the usual 
binary representation a(k) is replaced with a Gray code representation. In a 
Gray code, k and k+ 1 have all but one coefficient in common, and this makes 
it possible to construct the values x, recursively. 

One way to define a Gray code is to take the bitwise binary sum of the 
usual binary representation of k, a-—1(k)---a1(k)ao(k), with the shifted string 
Oar—ı(k)---aı(k); in other words, take the -sum of the binary representa- 
tions of k and |k/2|. This encodes the numbers 1 to 7 as follows: 


1 2 3 4 5 6 7 
Binary 001 010 011 100 101 110 111 
Gray code 001 011 010 110 111 101 100 


For example, the Gray code for 3 is calculated as 011 4 001. 

Exactly one bit in the Gray code changes when k is incremented to k +1. 
The position of that bit is the position of the rightmost zero in the ordinary 
binary representation of k, taking this to mean an initial zero if the binary 
representation has only ones. For example, since the last bit of the binary 
representation of 4 is zero, the Gray codes for 4 and 5 differ in the last bit. 
Because the binary representation of 7 is displayed as 111, the Gray code for 
8 differs from the Gray code for 7 through the insertion of an initial bit, which 


would produce 1100. 
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The binary strings formed by the Gray code representations of the integers 
0,1,...,2”—1 are a permutation of the sequence of strings formed by the usual 
binary representations of the same integers, for any r. If in the definition of the 
radical inverse function %2 in (5.8) we replaced the usual binary coefficients 
with Gray code coefficients, the first 2” — 1 values would be a permutation 
of the corresponding elements of the Van der Corput sequence. Hence, the 
two sequences have the same asymptotic discrepancy. Antanov and Saleev 
[18] show similarly that using a Gray code with Sobol’s construction does not 
affect the asymptotic discrepancy of the resulting sequence. 

Suppose, then, that in (5.21) we replace the binary coefficients a;(k) with 
Gray code coefficients g;(k) and redefine xz, to be 


Lk = go(k)u1 D gılk)vz posesa Gr—1(k)up. 
Suppose that the Gray codes of k and k + 1 differ in the £th bit; then 


Tk+1 = go(k F 1)vy P gı (k + 1)v2 Gp eerie Gr—1(k F Lu; 
= go(k)ui ® gi(k)ve Ð- B (ge(k) D 1)ve P- @ Gr_i(k)vy 


= Lp BD ve. (5.25) 


Rather than compute 2,41 from (5.21), we may thus compute it recursively 
from x, through binary addition of a single direction number. The computer 
implementations of Bratley and Fox [57] and Press et al. [299] use this formu- 
lation. 

It is worth noting that if we start the Sobol’ sequence at k = 0, we never 
need to calculate a Gray code. To use (5.25), we need to know only £, the index 
of the bit that would change if we calculated the Gray code. But, as explained 
above, £ is completely determined by the ordinary binary expansion of k. To 
start the Sobol’ sequence at an arbitrary point £n, we need to calculate the 
Gray code of n to initialize the recursion in (5.25). 

The simplification in (5.25) extends to the construction of Faure points 
in any arbitrary (prime) base b through an observation of Tezuka [348]. We 
digress briefly to explain this extension. Let ag(k), ai1(k),...,a,(k) denote the 
coefficients in the base-b expansion of k. Setting 


1 
go(k) b—11 ao(k) 
o = E a " mod b 
9(k) m aa ae 


defines a base-b Gray code, in the sense that the vectors thus calculated for 
k and k +1 differ in exactly one entry. The index of the entry that changes 
is the smallest £ for which ag(k) # b— 1 (padding the expansion of k with an 
initial zero, if necessary, to ensure it has the same length as the expansion of 


k +1). Moreover, ge(k + 1) = ge(k) + 1 modulo b. 
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| This simplifies the Faure construction. Instead of defining the coefficients 
y” (k) through the matrix-vector product (5.18), we may set 


(yP (k +1),-..,y® (k +1)) = 
(y® (k), ... yP (k) + (CO1, A, ..., CO (r, £)) mod b, 


once £ has been determined from the vector a(k). This makes it possible 
to replace the loop over both m and n near the bottom of FAUREPTS in 


Figure 5.11 with a loop over a single index. See Figure 5.13. 


Choosing Initial Direction Numbers 


In initializing the recurrence (5.23), we required only that each m; be an odd 
integer less than 21. But suppose we initialize two different sequences (corre- 
sponding to two different coordinates of a d-dimensional sequence) with the 
same values mj ,...,™m,. The first r columns of their generator matrices will 
then also be the same. The kth value generated in a Sobol’ sequence depends 
only on as many columns as there are coefficients in the binary expansion of 
k; so, the first 2” — 1 values of the two sequences would be identical. Thus, 
whereas the choice of initial values may not be significant in constructing a 
one-dimensional sequence, it becomes important when d such sequences are 
yoked together to make a d-dimensional sequence. 

Sobol’ [336] provides some guidance for choosing initial values. He es- 
tablishes two results on uniformity properties achieved by initial values sat- 
isfying additional conditions. A d-dimensional sequence 2o,71,... satisfies 
Sobol’s Property A if for every j = 0,1,... exactly one of the points £k, 
j24 < k < (j +1)2% falls in each of the 2% cubes of the form 


d 

a, Qai +1 

— i Lt. 
ESF) «eon 


i=1 


The sequence satisfies Sobol’s Property A’ if for every j = 0,1,... exactly one 
of the points zp, 7274 < k < (j + 1)274 falls in each of the 27% cubes of the 
form r 
üi Qj t 1 
| ` t E€ 0, 1, 2, 3 ‘ 
II É 4 ) A j} 
11 
These properties bear some resemblance to the definition of a (t, d)-sequence 
but do not fit that definition because they restrict attention to specific equi- 


lateral boxes. 


Let vf? denote the ath direction number associated with the ¿th coordinate 
of the sequence. The generator matrix of the ith sequence is then 


VO = fer? eso), 
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where we again use vw to denote the (column) vector of binary coefficients of 
a direction number as well as the direction number itself. Sobol’ [836] shows 
that Property A holds if and only if the determinant constructed from the 
first d elements of the first row of each of the matrices satisfies 


d d) d 
vo vi Lvl 


Property A applies to sets of size 2% and generating the first 2% points involves 
exactly the first d columns of the matrices. Sobol’ also shows that Property 
A’ holds if and only if the determinant constructed from the first 2d elements 
of the first two rows of each of the matrices satisfies 


1 1 1 
vE vv, 


d d dD) 
vi vid... vo, as 
vO vO. yË # 0 mod 2. 
DD. 2,2d 


d d d 
vio Ve | ai vo, 


To illustrate, suppose d = 3 and suppose the first three m; values for the 
first three coordinates are as follows: 


(m1, M2, M3) z (1; 1, 1), (M1, M2, M3) = (1,3,5), (M1, M2, M3) = (1, 1,7). 


The first coordinate has m; = 1 for all 7. The second coordinate is generated 
by a polynomial of degree 1, so all subsequent values are determined by mı; 
and because the third coordinate is generated by a polynomial of degree 2, 
that sequence is determined by the choice of mı and mg. 

From the first three m; values in each coordinate, we determine the ma- 


trices 


100 111 101 
VO =f[o10], V®={[010}], v®=[011 
001 001 001 


From the first row of each of these we assemble the matrix 


100 
DS) dled 
101 


Because this matrix has a determinant of 1, Sobol’s Property A holds. 
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The test matrix D can in fact be read directly from the m; values: Di; = 1 
if the m; value for the ith coordinate is greater than or equal to 277}, and 
Di; = 0 otherwise. 

Table 5.3 displays initial values of the m; for up to 20 dimensions. Recall 
that the number of initial values required equals the degree of the correspond- 
ing primitive polynomial, and this increases with the dimension. ‘The values 
displayed in parentheses are determined by the initial values in each row. 
The values displayed are from Bratley and Fox [57], who credit unpublished 
work of Sobol’ and Levitan (also cited in Sobol’ {336]). These values satisfy 
Property A; Property A’ holds for d < 6. 


1 1 (1) (1) 
2| 1 (8) (51) ( 
3| 1 1 (7) (11) (13) (61) (67) (79) 
4) 1 3 7 (5) (7) (43) (49) (147) 
5} 1 1 5 (3) (15) (51) (125) (141) 
66 1 3 1 1 (9) (59) (25) (89) 
7| 1 1 3 7 (31) (47) (109) (173) 
s| 1 3 3 9 9 (57) (43) (43) 
9J 1 3 7 13 3 (35) (89) (9) 
1 1 5 11 27 (53) (69) (25) 
1 3 5 1 15 (19) (113) (115) 
1 1 7 3 2 (51) (47) (97) 
1 3 #7 7 21 (61) (55) (19) 
1 1 1 9 23 37 (97) (97) 
1 3 3 5 19 33 (8) (197) 
i 1 3 13 11 7 (37) (101) 
1 1 7 13 25 85 (83) (255) 
1 3 5 11 7 ~~ 11 (103) (29) 
1 1 1 8 13 39 (27) (203) 
2; 1 3 1 15 17 63 13 (65) 


Table 5.3. Initial values satisfying Sobol’s Property A for up to 20 dimensions. 
In each row, values in parentheses are determined by the previous values in the 


sequence. 


Bratley and Fox [57] include initializing values from the same source for up 
to 40 dimensions. A remark in Sobol’ [336] indicates that the Sobol’-Levitan 
values should satisfy Property A for up to 51 dimensions; however, we find (as 
does Morland [270]) that this property does not consistently hold for d > 20. 
More precisely, for d ranging from 21 to 40, we find that Property A holds only 
at dimensions 23, 31, 33, 34, and 37. We have therefore limited Table 5.3 to the 
first 20 dimensions. The complete set of values used by Bratley and Fox [57] 
is in their FORTRAN program, available through the Collected Algorithms 
of the ACM. 
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Press et al. [299] give initializing values for up to six dimensions; their 
values fail the test for Property A in dimensions three, five, and six. A further 
distinction merits comment. We assume (as do Sobol’ [336] and Bratley and 
Fox [57]) that the first coordinate uses m; = 1; this makes the first generator 
matrix the identity and thus makes the first coordinate the Van der Corput 
sequence in base 2. We use the polynomial x + 1 for the second coordinate, 
and so on. Press et al. [299] use x + 1 for the first coordinate. Whether or 
not Property A holds for a particular set of initializing values depends on 
whether the first row (with m; = 1) of Table 5.3 is included. Thus, one 
cannot interchange the initializing values used here with those used by Press 
et al. [299] for the same primitive polynomial, even in cases where both satisfy 
Property A. 

The implementation of Lemieux, Cieslak, and Luttmer [231] includes ini- 
tializing values for up to 360 dimensions. These values do not necessarily 
satisfy Sobol’s property A, but they are the result of a search for good values 
based on a resolution criterion used in design of random number generators. 
See [231] and references cited there. 


Discrepancy 


In the terminology of Section 5.1.4, Sobol’ sequences are (t, d)-sequences in 
base 2. ‘The example in the right panel of Figure 5.3 is the projection onto 
dimensions four and five of the first 128 points of a five-dimensional Sobol’ 
sequence. 

Theorem 3.4 of Sobol’ [335] provides a simple expression for the t para- 
meter in a d-dimensional sequence as 


t=q +q +: +qa-1-d+H1, (5.26) 


where gi < q2 < ++: < qa-1 are the degrees of the primitive polynomials 
used to construct coordinates 2 through d. Recall that the first coordinate 
is constructed from the degenerate recurrence with m; = 1, which may be 
considered to have degree zero. If instead we used polynomials of degrees 
qi, ---, qa for the d coordinates, the t value would be q1 +--+ qa — d. Sobol’ 
[335] shows that while t grows faster than d, it does not grow faster than 
dlog d. 

Although (5.26) gives a valid value for t, it does not always give the best 
possible value: a d-dimensional Sobol’ sequence may be a (t’, d)-sequence for 
some t’ < t. Sobol [335] provides conditions under which (5.26) is indeed the 
minimum valid value of t. 

Because they are (t,d)-sequences, Sobol’ points are low-discrepancy se- 
quences; see (5.13) and the surrounding discussion. Sobol’ [335] provides more 
detailed discrepancy bounds. 
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Implementation 


Bratley and Fox [57] and Press et al. [299] provide computer programs to 
generate Sobol’ points in FORTRAN and C, respectively. Both take advantage 
of bit-level operations to increase efficiency. Timings reported in Bratley and 
Fox [57] indicate that using bit-level operations typically increases speed by 
a factor of more than ten, with greater improvements in higher dimensions. 

We give a somewhat more schematic description of an implementation, 
suppressing programming details in the interest of transparency. Our descrip- 
tion also highlights similarities between the construction of Sobol’ points and 
Faure points. 

As in the previous section, we separate the construction of generator ma- 
trices from the generation of the points themselves. Figure 5.12 displays a 
function SOBOLMAT to produce a generator matrix as described in the dis- 
cussion leading to (5.24). The function takes as input a binary vector Cyec 
giving the coefficients of a primitive polynomial, a vector Minit of initializing 
values, and a parameter r determining the size of the matrix produced. For 
a polynomial of degree q, the vector Cvec has the form (1,c1,...,Cg—1, 1); see 
(5.22). The vector Minit must then have q elements — think of using the row 
of ‘Table 5.3 corresponding to the polynomial cvece, including only those values 
in the row that do not have parentheses. The parameter r must be at least as 
large as q. Building an r x r matrix requires calculating mg41,...,m, from 
the initial values m1,..., Mg in Minit. These are ultimately stored in Myec 
which (to be consistent with Table 5.3) orders m1,...,m, from left to right. 
In calling SOBOLMAT, the value of r is determined by the number of points 
to be generated: generating the point təx requires r =k +1. 

The function SOBOLMAT could almost be substituted for FAUREMAT 
in the function FAUREPTS of Figure 5.11: the only modification required 
is passing the ith primitive polynomial and the 7th set of initializing values, 
rather than just 7 itself. The result would be a legitimate algorithm to generate 
Sobol’ points. 

Rather than reproduce what we did in FAUREPT', here we display an 
implementation using the Gray code construction of Antanov and Saleev [18]. 
The function SOBOLPTS in Figure 5.13 calls an undefined function GRAY- 
CODE2 to find a binary Gray code representation. This can implemented 
as 

GRAYCODE2(n) = B-ARY(n, 2) 6 B-ARY(|n/2], 2), 
after padding the second argument on the right with an initial zero to give ' 
the two arguments the same length. 

In SOBOLPTS, a Gray code representation is explicitly calculated only 
for no — 1. The Gray code vector g is subsequently incremented by toggling 
the th bit, with Z determined by the usual binary representation a, or by 
inserting a leading 1 at each power of 2 (in which case £ = 1). As in (5.25), 
the value of Z is then the index of the column of V“ to be added (mod 
2) to the previous point. The coefficients of the binary expansion of the ith 
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SOBOLMAT (Cvec, Minit, T) 
[evec has the form (1,c1,...,¢€q-1,1), Minit has length q < r] 
q — length(cvec) — 1 
if(q=0) V — I [r xr identity] 
if (q4>0) 
Mvyec — (Minit(1,...,q),0,...,0) [length r] 


Mstate Minit 
for i =q+1,...,r 
Mnext S 2c1 Mstate(q) D AcoMstate(q z 1) pD- P 2 tiat L) p Mstate(1) 


Myvyec (i) — Mnext 
Mstate — (Mstate(2, e. 9), Tinext) 
Org Shet 
mpin — B-ARY (mveclj), 2) 
k —length(Mpin) 
for 2 = dirak 
V(j—i+1,j) — mpin(k — i+ 1) 
return V 


Fig. 5.12. Function SOBOLMAT (cvec, Minit, r) returns r x r generator matrix V 
constructed from polynomial coefficients (1, c1, ...,Cq—1, 1) in Cvec and q initial values 


in array Minit- 


coordinate of each point are held in the ith column of the array y. Taking 
the inner product between this column and the powers of 1/2 in bpwrs maps 
the coefficients to [0,1). The argument Ppvec is an array encoding primitive 
polynomials using the numerical representation in Table 5.2, and Mmat is an 
array of initializing values (as in Table 5.3). 


5.2.4 Further Constructions 


The discussions in Sections 5.2.2 and 5.2.3 make evident similarities between 
the Faure and Sobol’ constructions: both apply permutations to segments of 
the Van der Corput sequence, and these permutations can be represented 
through generator matrices. This strategy for producing low-discrepancy se- 
quences has been given a very general formulation and analysis by Nieder- 
reiter [281]. Points constructed in this framework are called digital nets or 
sequences. Section 4.5 of Niederreiter [281] presents a special class of digital 
sequences, which encompass the constructions of Faure and Sobol’. Niederre- 
iter shows how to achieve a t parameter through this construction (in base 
2) strictly smaller than the best t parameter for Sobol’ sequences in all di- 
mensions greater than seven. Thus, these Niederreiter sequences have some 
theoretical superiority over Sobol’ sequences. Larcher [219] surveys more re- 
cent theoretical developments in digital point sets. 

Bratley, Fox, and Niederreiter [58] provide a FORTRAN generator for 
Niederreiter sequences. They note that for base 2 their program is “essentially 
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SOBOLPTS (no, Npts, d, Pvec, Mmat ) 
Nmax — No + Mpts — 1 
Tmax — 1+ | (log(mmax)/ log(2))], ee 
P—0O |npts x d], yO [rmax Xd] 
if (no > 1) r — 14 [(log(no — 1)/log(2))| 
Qnext — oe 
a — B-ARY (no — 1, 2) 
g — GRAYCODE2(npo — 1) 
fori=1,...,d [build matrices using polynomials in pyec| 
q — [(log(pvec(t))/ log(2)) | 
Cvece — B-ARY (pvec (i), 2) 
V“ — SOBOLMAT (Cvec; (™Mmat(é, 1), -.-, Mmat (i, q)), Tmax) 
bowrs = (1/2, 1/4, er Lj Qn) 


for i =1,...,d [Calculate point no — 1 using Gray code] 
form =1,...,7r 
orne heset 
y(m,i) — y(m,i) + VO (m,n) * g(r — n+ 1) mod 2 
fOr k= Noye aes 
if (k = Guest) 
rr+1 


g — (1, g) [insert 1 in Gray code at powers of 2] 
L1 [frst bit changed] 
Qnext <— 2 * Qnext 
else 
£ — index of rightmost zero in a 
g(£)— 1-—g(£) [increment Gray code] 
a — NEXTB-ARY(a, 2) 


for i = 1,...,d [Calculate point k recursively] 
for m = 1,...,r 
y(m,i)— y(m,i) + V® (m,r — £+ 1) mod 2 
for a E 


P(k — no T 1,7) ee Pik = tot 1,7) + bpwrs (J) * y(J, i) 


return P 


Fig. 5.13. Function SOBOLPTS (no, npts, d, Pvec, Mmat) returns Npts X d array whose 
rows are coordinates of d-dimensional Sobol’ points, using polynomials encoded in 
Dvec and initializing values in the rows of Mmat. 


identical” to one for generating Sobol’ points, differing only in the choice of 
generator matrices. Their numerical experiments indicate roughly the same 
accuracy using Niederreiter and Sobol’ points on a set of test integrals. 
Tezuka [347] introduces a counterpart of the radical inverse function with 
respect to polynomial arithmetic. This naturally leads to a generalization of 
Halton sequences; Tezuka also extends Niederreiter’s digital construction to 
this setting and calls the resulting points generalized Niederreiter sequences. 
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Tezuka and Tokuyama [349] construct (0, d)-sequences in this setting using 
generator matrices for which they give an explicit expression that generalizes 
the expression in (5.19) for Faure generator matrices. Tezuka [348] notes that 
these generator matrices have the form 


AM(CM) 1 §=1,...,4, (5.27) 


with C®) as in (5.19), and A” arbitrary nonsingular (mod b) lower triangular 
matrices. The method of Tezuka and Tokuyama [349] is equivalent to taking 
A to be the transpose of (C“))*-!. Tezuka [348] shows that all sequences 
constructed using generator matrices of the form (5.27) in a prime base b > d 
are (0, d)-sequences. He calls these generalized Faure sequences; they are a 
special case of his generalized Niederreiter sequences and they include ordinary 
Faure sequences (take each A“ to be the identity matrix). Although the path 
leading to (5.27) is quite involved, the construction itself requires only minor 
modification of an algorithm to generate Faure points. 

Faure [117] proposes an alternative method for choosing generator matrices 
to construct (0,d)-sequences and shows that these do not have the form in 
(O27). 

A series of theoretical breakthroughs in the construction of low-discrepancy 
sequences have been achieved by Niederreiter and Xing using ideas from alge- 
braic geometry; these are reviewed in their survey article [282]. Their methods 
lead to (t,d)-sequences with theoretically optimal t parameters. Pirsic [298] 
provides a software implementation and some numerical tests. Further numer- 
ical experiments are reported in Hong and Hickernell [187]. 


5.3 Lattice Rules 


The constructions in Sections 5.2.1—5.2.3 are all based on extending the Van 
der Corput sequence to multiple dimensions. The lattzce rules discussed in this 
section provide a different mechanism for constructing low-discrepancy point 
sets. Some of the underlying theory of lattice methods suggests that they 
are particularly well suited to smooth integrands, but they are applicable to 
essentially any integrand. 

Lattice methods primarily define fixed-size point sets, rather than infi- 
nite sequences. ‘This is a shortcoming when the number of points required to 
achieve a satisfactory precision is not known in advance. We discuss a mecha- 
nism for extending lattice rules after considering the simpler setting of a fixed 


number of points. 
A rank-1 lattice rule of n points in dimension d is a set of the form 


k 
{ Ev mod 1 k=0,1,...n—1}, (5.28) 


with v a d-vector of integers. Taking the remainder modulo 1 means taking 
the fractional part of a number (x mod 1 = x — |x|), and the operation is 
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applied separately to each coordinate of the vector. To ensure that this set 
does indeed contain n distinct points (i.e., that no points are repeated), we 
require that n and the components of v have 1 as their greatest common 


divisor. 
An n-point lattice rule of rank r takes the form 


r ke; 
[5 Ev moa 1 Kee) Vee niga T rato) 


i=1 ¢ 


for linearly independent integer vectors v1,..., Vr and integers n1,...,N, = 2 
with each n; dividing ni+ı, i = 1,...,7 — 1, and n1---n, = n. As in the 
rank-1 case, we require that n; and the elements of v; have 1 as their greatest 
common divisor. 

Among rank-1 lattices, a particularly simple and convenient class are the 
Korobov rules, which have a generating vector v of the form (1, a,a”,...,a%~+), 
for some integer a. In this case, (5.28) can be described as follows: for each 
k=0,1,... n — 1, set yo = k, uo = k/n, 


i = ayi mod n, i=1,...,d-1, u=yi/n, 


and set £k = (uo, U1,---,Ud—1). Comparison with Section 2.1.2 reveals that 
this is the set- of vectors formed by taking d consecutive outputs from a mul- 
tiplicative congruential generator, from all initial seeds yo. 

It is curious that the same mechanism used in Chapter 2 to mimic random- 
ness is here used to try to produce low discrepancy. The apparent paradox is 
resolved by noting that here we intend to use the full period of the generator 
(we choose the modulus n equal to the number of points to be generated), 
whereas the algorithms of Section 2.1 are designed so that we use a small 
fraction of the period. To reconcile the two applications, we would like the 
discrepancy of the first N out of n points to be O(1/V N) for small N and 
O((log N)4/N) for large N. 

The connection between Korobov rules and multiplicative congruential 
generators has useful consequences. It simplifies implementation and it fa- 
cilitates the selection of generating vectors by making relevant the exten- 
sively studied properties of random number generators; see Hellekalek [176], 
Hickernell, Hong, L’Ecuyer, and Lemieux [184], L’Ecuyer and Lemieux [226], 
Lemieux and L’Ecuyer [232], and Niederreiter [281]. Hickernell [182], Chapter 
5 of Niederreiter [281], and Sloan and Joe [333] analyze the discrepancy of 
lattice rules and other measures of their quality. 

Tables of good generating vectors v can be found in Fang and Wang [115] 
for up to 18 dimensions. Sloan and Joe [333] give tables for higher-rank lattice 
rules in up to 12 dimensions. L’Ecuyer and Lemieux [226] provide tables of 
multipliers a for Korobov rules passing tests of uniformity. 
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Integration Error 


The Koksma-Hlawka bound (5.10) applies to lattice rules as it does to all 
point sets. But the special structure of lattice rules leads to a more explicit 
expression for the integration error using such a rule, and this in turn sheds 
light on both the design and scope of these methods. 

Fix an integrand f on [0,1]¢, and for each d-vector of integers z define the 


Fourier coefficient 


z J fla) e?" VTT z dy. 
[0,1)4 


The integral of f over the hypercube is f (0). Suppose that f is sufficiently 
regular to be represented by its Fourier series, in the sense that 


i= io (5.29) 


the sum ranging over all integer vectors z and converging absolutely. 
A rank-1 lattice rule approximation to the integral of f is 


ys (- v mod 1) Siw exp (2rv—1kv! z/n) 


k=0 Z 


|l 


LI, >i exp (2r Ziv" z/n)]". (5.30) 
RE 


The first equality follows from the Fourier representation of f and the period- 
icity of the function u + exp(27./—1u), which allows us to omit the reduction 
modulo 1. For the second equality, the interchange in the order of summation 
is justified by the assumed absolute convergence of the Fourier series for f. 
Now the average over k in (5.30) simplifies to 


if v'z=0 mod n, 


1 n—I i 1, 
n D [exp (27 ~1v' z/n)] ~ 7 otherwise. 


To see why, observe that if v'z/n is an integer then each of the summands 
on the left is just 1; otherwise, 


I Onan a = 1—exp(2ry—iv! z2) 


k=0 


1 —exp(2m/—IvTz/n) 


because v! z is an integer. Using this in (5.30), we find that the lattice rule 
approximation simplifies to the sum of f (z) over all integer vectors z for which 
v'z=0 mod n. The correct value of the integral is f (0), so the error in the 
approximation is 
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` f(z). (5.31) 


z40,v'z=0 mod n 


The values of |f(z)| = |f(z1,..., za)| for large values of |z| = |z1]+---+|zal 
reflect the smoothness of f, in the sense that large values of |z| correspond 
to high-frequency oscillation terms in the Fourier representation of f. The 
expression in (5.31) for the integration error thus suggests the following: 


(i) the generator v should be chosen so that v! z = 0 mod n only if |z| is 
large — vectors v with this property are known informally as good lattice 
points; 

(ii) lattice rules are particularly well suited to integrands f that are smooth 
precisely in the sense that f(z) decreases quickly as |z| increases. 


The first of these observations helps guide the search for effective choices 
of v. (Results showing the existence of good lattice points are detailed in 
Chapter 5 of Niederreiter [281].) Precise criteria for selecting v are related to 
the spectral test mentioned in the discussion of random number generators in 
Section 2.1. Recommended values are tabulated in Fang and Wang [115]. 

The direct applicability of observation (ii) seems limited, at least for the 
integrands implicit in derivative pricing. Bounding the Fourier coefficients of 
such functions is difficult, and there is little reason to expect these functions to 
be smooth. Moreover, the derivation leading to (5.31) obscures an important 
restriction: for the Fourier series to converge absolutely, f must be continuous 
on (0, 1]? and periodic at the boundaries (because absolute convergence makes 
f the uniform limit of functions with these properties). Sloan and Joe [333] 
advise caution in applying lattice rules to nonperiodic integrands; in their nu- 
merical results, they find Monte Carlo to be the best method for discontinuous 


integrands. 


Extensible Lattice Rules 


We conclude our discussion of lattice rules with a method of Hickernell et al. 
[184] for extending fixed-size lattice rules to infinite sequences. 

Consider a rank-1 lattice rule with generating vector v = (v1,..., Ua). 
Suppose the number of points n equals b” for some base b and integer r. ‘Then 
the segment 7,(0),%(1),...,¢,(n — 1) of the Van der Corput sequence in 
base b is a permutation of the coefficients k/n, k = 0,1,...,n — 1, appearing 
in (5.28). The point set is therefore unchanged if we represent it as 


{4 (k)v mod 1, k=0,1,...,n—1}. 


We may now drop the upper limit on k to produce an infinite sequence. 

The first b” points in this sequence are the original lattice point set. Each 
of the next (b—1) nonoverlapping segments of length b” will be shifted versions 
of the original lattice. The first b°t+ points will again form a lattice rule of 
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the type (5.28), but now with n replaced by b"*', and so on. In this way, the 
construction extends and refines the original lattice. 

Hickernell et al. [184] give particular attention to extensible Korobov rules, 
which are determined by the single parameter a. They provide a table of values 
of this parameter that exhibit good uniformity properties when extended using 
b = 2. Their numerical results use a = 17797 and a = 1267. 


5.4 Randomized QMC 


We began this chapter with the suggestion that choosing points determinis- 
tically rather than randomly can reduce integration error. It may therefore 
seem odd to consider randomizing points chosen carefully for this purpose. 
There are, however, at least two good reasons for randomizing QMC. 

The first reason is as immediately applicable as it is evident: by random- 
izing QMC points we open the possibility of measuring error through a confi- 
dence interval while preserving much of the accuracy of pure QMC. Random- 
ized QMC thus seeks to combine the best features of ordinary Monte Carlo 
and quasi-Monte Carlo. The tradeoff it poses — sacrificing some precision to 
get a better measure of error — is essentially the same one we faced with 
several of the variance reduction techniques of Chapter 4. 

The second reason to consider randomizing QMC is less evident and may 
also be less practically relevant: there are settings in which randomization 
actually improves accuracy. A particularly remarkable result of this type is a 
theorem of Owen [289] showing that the root mean square error of integration 
using a class of randomized nets is O(1/n*:°~*), whereas the error without 
randomization is O(1/n1~*). Owen’s result applies to smooth integrands and 
may therefore be of limited applicability to pricing derivatives; nevertheless, 
it is notable that randomization takes advantage of the additional smooth- 
ness though QMC does not appear to. Hickernell [180], Matoušek [257], and 
L’Ecuyer and Lemieux [226] discuss other ways in which randomization can 
improve accuracy. 

We describe four methods for randomizing QMC. For a more extensive 
treatment of the topic, see the survey article of L’Ecuyer and Lemieux [226] 
and Chapter 14 of Fox [127]. We limit the discussion to randomization of point 
sets of a fixed size n. We denote such a point set generically by 


Comer n]; 


each x; an element of [0, 1)?. 


Random Shift 


The simplest randomization of the point set P„ generates a random vector U 
uniformly distributed over the d-dimensional unit hypercube and shifts each 
point in Pa by U, modulo 1: 
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P,(U) = {zi + U mod 1,i = 1,...,n}. (5.32) 


The reduction mod 1 applies separately to each coordinate. The randomized 
QMC estimate of the integral of f is 


rr 


I (U) = ESO f(a: +U mod 1). 


w= 1 


This mechanism was proposed by Cranley and Patterson [92] in the setting 
of a lattice rule, but can be applied with other low-discrepancy point sets. It 
should be noted, however, that the transformation changes the discrepancy of 
a point set and that a shifted (t, m, d)-net need not be a (t, m, d)-net. 

For any P, C [0,1)%, each element of P,(U) is uniformly distributed 
over the hypercube, though the points are clearly not independent. Repeat- 
ing the randomization with independent replications of U produces indepen- 
dent batches of n points each. Each batch yields a QMC estimate of the form 
I;(U), and these estimates are independent and identically distributed. More- 
over, each (U) is an unbiased estimate of the integral f, so computing an 
asymptotically valid confidence interval for the integral is straightforward. 

L’Ecuyer and Lemieux [226] compare the variance of Ip (U) and an ordinary 
Monte Carlo estimate with P, a lattice rule. They show that either variance 
could be smaller, depending on the integrand f, but argue that J;(U) often 
has smaller variance in problems of practical interest. 

The random-shift procedure may be viewed as an extreme form of system- 
atic sampling (discussed in Section 4.2), in which a single point U is chosen 
randomly and n points are then chosen deterministically conditional on U. The 
variance calculation of L’Ecuyer and Lemieux [226] for a randomly shifted lat- 
tice rule has features in common with the calculation for systematic sampling 


in Section 4.2. 


Random Permutation of Digits 


Another mechanism for randomizing QMC applies a random permutation 
of 0,1,...,5 — 1 to the coefficients in the base-b expansion of the coor- 
dinates of each point. Consider, first, the one-dimensional case and write 
£k = 0.a1(k)ag(k)... for a b-ary representation of £. Let mj, j = 1,2,... 
be independent random permutations of {0,1,...,b— 1}, uniformly distrib- 
uted over all b! permutations of the set. Randomize P, by mapping each point 
Tk to the point 0.71 (a1(k))m2(a2(k))..., applying the same permutations 7; 
to all points x. For a d-dimensional point set, randomize each coordinate in 
this way, using independent permutations for different coordinates. 

Randomizing an arbitrary point x € [0, 1)? in this way produces a random 
vector uniformly distributed over [0,1)¢. Thus, for any Pha, the average of 
f over the randomization of P, is an unbiased estimate of the integral of 
f. Independent randomizations produce independent estimates that can be 
combined to estimate a confidence interval. 
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Matoušek [257] analyzes the expected mean-square discrepancy for a gen- 
eral class of randomization procedures that includes this one. This random- 
ization maps b-ary boxes to b-ary boxes of the same volume, so if Pa is a 
(t, m, d)-net, its randomization is too. 


Scrambled Nets 


Owen [287, 288, 289] introduces and analyzes a randomization mechanism 
that uses a hierarchy of permutations. This scrambling procedure permutes 
each digit of a b-ary expansion, but the permutation applied to the jth digit 
depends on the first 7 — 1 digits. 

To make this more explicit, first consider the one-dimensional case. Sup- 
pose x has b-ary representation 0.a,a,a3.... The first coefficient a; is mapped 
to (a1), with 7 a random permutation of {0,1,...,5— 1}. The second co- 
efficient is mapped to Ta, (a2), the third coefficient to Ta,az (a3), and so on; 
the random permutations T, Tai, Taraz- Qj = 0,1,...,b-1, j = 1,2,..., 
are independent with each uniformly distributed over the set of all permu- 
tations of {0,1,...,b—1}. To scramble a d-dimensional point set, apply this 
procedure to each coordinate, using independent sets of permutations for each 
coordinate. 

Owen [290] describes scrambling as follows. In each coordinate, partition 
the unit interval into b subintervals of length 1/b and randomly permute those 
subintervals. Further partition each subinterval into b subintervals of length 
1/b* and permute those, randomly and independently, and so on. At the jth 
step, this procedure constructs b’~! partitions, each consisting of b intervals, 
and permutes each partition independently. In contrast, Matousek’s random 
digit permutation applies the same permutation to all b’~! partitions at each 
step 7. 

Owen [287] shows that a scrambled (t, m, d)-net is a (t, m, d)-net with prob- 
ability one, and a scrambled (t, d)-sequence is a (t, d)-sequence with probabil- 
ity one. Owen [288, 289, 290] shows that the variance of a scrambled net esti- 
mator converges to zero faster than the variance of an ordinary Monte Carlo 
estimator does, while cautioning that the faster rate may not set in until the 
number of points becomes very large. For sufficiently smooth integrands, the 
variance is O(1/n°~*) in the sample size n. The superior asymptotic perfor- 
mance with randomization results from cancellation of error terms. For fixed 
sample sizes, Owen [288, 290] bounds the amount by which the scrambled 
net variance can exceed the Monte Carlo variance. Hickernell and Hong [183] 
analyze the mean square discrepancy of scrambled nets. 

Realizing the attractive features of scrambled nets in practice is not en- 
tirely straightforward because of the large number of permutations required 
for scrambling. Tan and Boyle [345] propose an approximate scrambling 
method based on permuting just the first few digits and find experimentally 
shat it works well. Matoušek [257] outlines an implementation of full scram- 
sling that reduces memory requirements at the expense of increasing comput- 


| 
| 
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ing time: rather than store a permutation, he stores the state of the random 
number generator and regenerates each permutation when it is needed. Hong 
and Hickernell [187] define a simplified form of scrambling and provide algo- 
rithms that generate scrambled points in about twice the time required for 
unscrambled points. 


Linear Permutation of Digits 


As an alternative to full scrambling, Matoušek [257] proposes a “linear” 
permutation method. This method maps a base-b expansion 0.a,a2... to 


0.a102... using 


j 
a; = X hija + Qj mod b, 
i=1 

with the h;i; and g; chosen randomly and independently from {0,1,...,5—1} 
and the hi; required to be positive. This method is clearly easier to implement 
than full scrambling. Indeed, if the g; were all 0, this would reduce to the 
generalized Faure method in (5.27) when applied to a Faure net P,. The 
condition that the diagonal entries hi; be positive ensures the nonsingularity 
required in (5.27). 


All of the randomization methods described in this section produce points 
uniformly distributed over [0,1)¢ and thus unbiased estimators of integrals 
over [0, 1) when applied in the QMC approximation (5.2). Through indepen- 
dent replications of any of these it is a simple matter to construct asymptot- 
ically valid confidence intervals. The methods vary in evident ways in their 
computational requirements; the relative merits of the estimates they produce 
are less evident and warrant further investigation. 


5.5 The Finance Setting 


Our discussion of quasi-Monte Carlo has thus far been fairly abstract, dealing 
with the generic problem of numerical integration over [0,1)%. In this sec- 
tion, we deal more specifically with the application of QMC to the pricing of 
derivative securities. Section 5.5.1 discusses numerical results comparing QMC 
methods and ordinary Monte Carlo on some test problems. Section 5.5.2 dis- 
cusses ways of taking advantage of the structure of financial models to enhance 
the effectiveness of QMC methods. 


5.5.1 Numerical Examples 


Several articles have reported numerical results obtained by applying QMC 
methods to financial problems. These include Acworth et al. [4], Berman [45], 
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Boyle et al. [53], Birge [47], Caflisch, Morokoff, and Owen [73], Joy, Boyle, and 
Tan [204], Ninomiya and Tezuka [283], Papageorgiou and Traub [293], Paskov 
[295], Paskov and Traub [296], Ross [309], and Tan and Boyle [345]. These in- 
vestigations consider several different QMC methods applied to various pricing 
problems and find that they work well. We comment more generally on the 
numerical evidence after considering some examples. 

A convenient set of problems for testing QMC methods are options on 
geometric averages of lognormally distributed asset prices. ‘These options are 
tractable in arbitrarily high dimensions (and knowing the correct answer is 
useful in judging performance of numerical methods) while sharing features 
of more challenging multiple-asset and path-dependent pricing problems. We 
consider, then, options with payoffs (S — K)* where either 


d 
Sper (5.33) 
i=1 
for multiple assets S;,...,5g, or 
E d 
$= | | SGT /da}”, (5.34) 
i=1 
for a single asset S. The underlying assets 51,...,Sq or S are modeled as 


geometric Brownian motion. Because S' is lognormally distributed in both 
cases, the option price is given by a minor modification of the Black-Scholes 
formula, as noted in Section 3.2.2. 

The two cases (5.33) and (5.34) reflect two potential sources of high dimen- 
sionality in financial problems: d is the number of underlying assets in (5.33) 
and it is the number of time steps in (5.34). Of course, in both cases S is the 
geometric average of (jointly)-lognormal random variables so this distinction 
is purely a matter of interpretation. The real distinction is the correlation 
structure among the averaged random variables. In (5.34), the correlation is 
determined by the dynamics of geometric Brownian motion and is rather high; 
in (5.33), we are free to choose any correlation matrix for the (logarithms of 
the) d assets. Choosing a high degree of correlation would be similar to reduc- 
ing the dimension of the problem; to contrast with (5.34), in (5.33) we choose 
the d assets to be independent of each other. 

A comparison of methods requires a figure of merit. For Monte Carlo meth- 
ods, variance is an appropriate figure of merit — at least for unbiased esti- 
mators with similar computing requirements, as argued in Section 1.1.3. The 
average of n independent replications has a variance exactly n times smaller 
than the variance of a single replication, so a comparison of variances is not 
tied to a particular sample size. In contrast, the integration error produced by 
a QMC method does depend on the number of points n, and often quite errat- 
ically. Moreover, the QMC error can be quite sensitive to problem parameters. 
This makes the comparison of QMC methods less straightforward. 
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As our figure of merit, we take the root mean square error or root mean 
square relative error over a fixed set of problem instances. This is somewhat 
arbitrary (especially in the choice of instances) but nevertheless informative. 
Given m problems with true values C,...,Cm and n-point QMC approxi- 
mations C;(n),...,Cm(n), the root mean square error is 


RMSE(n) = 


and the RMS relative error is 


In order to compare QMC methods with Monte Carlo, we extend these 
definitions to random estimators C;(n) by replacing (C;(n) — C;)? with 
E[(C;(n) — C;)?] in both cases. 

Our first example is based on (5.33) with d = 5 assets; as this is a rela- 
tively low-dimensional problem, it should be particularly well suited to QMC 
methods. For simplicity, we take the five assets to be independent copies of the 
same process GBM(r, o°) with an initial value of $;(0) = 100. We fix r at 5%, 
and construct 500 problem instances through all combinations of the following 
parameters: the maturity T is 0.15, 0.25, 0.5, 1, or 2 years; the volatility o 
varies from 0.21 to 0.66 in increments of 0.05; and the strike K varies from 
94 to 103 in increments of 1. These 500 options range in price from 0.54 to 
12.57; their average value is 5.62, and half lie between 4.06 and 7.02. 

Figure 5.14 plots the RMSE against the number of points, using a log scale 
for both axes. For the QMC methods, the figure displays the exact number of 
points used. For the Sobol’ points, we skipped the first 256 points and then 
chose the number of points to be powers of two. For the Faure points, we 
skipped the first 625 (= 5*) points and then chose the number of points to be 
powers of five (the base). These choices are favorable for each method. For the 
lattice rules the number of points is fixed. We used the following generating 


vectors from p.287 of Fang and Wang [115]: 


1069/1, 63, 762, 970, 177 

40011, 1534, 568, 3095, 2544 
15019/1, 10641, 2640, 6710, 784 
11053|1, 33755, 65170, 12740, 6878 


These we implemented using the shifted points k(v — 0.5)/n (mod 1) as sug- 
gested in Fang and Wang [115], rather than (5.28). We also tested Korobov 
rules from L’Ecuyer and Lemieux [226] generated by a = 331, 219, 1716, 7151, 
RRK and KAQ2: these wave rather noor and erratic results and are therefore 
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omitted from the figure. Using Monte Carlo, the RMSE scales exactly with 
/n, so we estimated it at n = 64000 and then extended this value to other 
values of n. 

Figure 5.14 suggests several observations. ‘The QMC methods produce root 
mean square errors three to ten times smaller than those of Monte Carlo over 
the range of sample sizes considered. Faure points appear to outperform the 
lattice rules and Sobol’ points outperform the Faure points. In addition to 
producing smaller errors, the QMC methods appear to converge at a faster 
rate than Monte Carlo: their graphs are not only lower, they have a steeper 
slope. For Sobol’ and Faure points, a convergence rate close to O(1/n) (evi- 
denced by a slope close to —1) sets in after a few thousand points. The slope 
for Monte Carlo is exactly —1/2 by construction. 


Monte Carlo 


71053 Lattice 


10 
i 10° 10° 10° 


Number of Points 


Fig. 5.14. Root mean square errors in pricing 500 options on the geometric mean 
of five underlying assets. 


The relative smoothness of the convergence of the Faure and Sobol’ ap- 
proximations relies critically on our choice of favorable values of n for each 
method. For example, taking n = 9000 produces larger RMS errors than 
n = 3125 for the Faure sequence and larger than n = 4096 for the Sobol’ 
sequence. The various points plotted for lattice rules are unrelated to each 
other because each value of n uses a different generating vector, whereas the 
Sobol’ and Faure results use initial segments of infinite sequences. 
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Figures 5.15 and 5.16 examine the effect of increasing problem dimension 
while keeping the number of points nearly fixed. Figure 5.15 is based on (5.33) 
with d =10, 30, 40, 70, 100, and 150. For this comparison we held the maturity 
T fixed at 0.25, let the strike K range from 94 to 102 in increments of 2, and 
let o vary as before. Thus, we have fifty options for each value of d. 

Because increasing d in the geometric mean (5.33) has the effect of reducing 
the volatility of S, the average option price decreases with d, dropping from 
3.46 at d = 10 to 1.85 at d = 150. Root mean square errors also decline, so 
to make the comparison more meaningful we look at relative errors. These 
increase with d for all three methods considered in Figure 5.15. The Monte 
Carlo results are estimated RMS relative errors for a sample size of 5000, 
but estimated from 64,000 replications. The Sobol’ sequence results in all 
dimensions skip 4096 points and use n = 5120; this is 21? + 21° and should be 
favorable for a base-2 construction. For the Faure sequence, the base changes 
with dimension. For each d we chose a value of n near 5000 that should be 
favorable for the corresponding base: these values are 4- 11°, 5-312 +7- 31, 
3-417, 717, 50-101, and 34-151. In each case, we skipped the first bt points, 
with b the base. The figure suggests that the advantage of the QMC methods 
relative to Monte Carlo declines with increasing dimension but is still evident 


at d = 150. 


Monte Carlo 
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Sobol’ 


RMS Relative Error 
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Fig. 5.15. Root mean square relative error in pricing options on the geometric 
average of d assets, with d the dimension. 


The comparison in Figure 5.16 is similar but uses (5.34), so d now indexes 
the number of averaging dates along the path of a single asset. For this com- 
narison we fixed T at 0.25. we let K varv from 96 to 104 in increments of 2, 
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and let o vary as before, to produce a total of fifty options. Each option price 
approaches a limit as d increases (the price associated with the continuous av- 
erage), and the Monte Carlo RMSE is nearly constant across dimensions. The 
errors using Faure points show a sharp increase at d = 100 and d = 150. The 
errors using Sobol’ points show a much less severe dependence on dimension. 
The number of points used for all three methods are the same in Figure 5.16 


as Figure 5.15. 


0.25 


0.2 5134 Faure 


Monte Carlo 
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Fig. 5.16. Root mean square error in pricing options on the geometric time-average 
of d values of a single asset, with d the dimension. 


Without experimentation, it is difficult to know how many QMC points 
to use to achieve a desired accuracy. In ordinary Monte Carlo, one can use a 
standard error estimated from a modest number of replications to determine 
the number of replications to undertake in a second stage of sampling to reach 
a target precision. Some authors have proposed stopping rules for QMC based 
on monitoring fluctuations in the approximation — rules that stop once the 
fluctuations are smaller than the required error tolerance. But such procedures 
are risky, as illustrated in Figure 5.17. The figure plots the running average 
of the estimated price of an option on the geometric average of 30 assets 
(with T = 0.25, o = 0.45, and K = 100) using Faure points. An automatic 
stopping rule would likely detect convergence — erroneously — near 6000 
points or 13000 points where the average plateaus. But in both cases, the 
QMC approximation remains far from the true value, which is not crossed 
until after more than 19000 points. These results use Faure points from the 
start of the sequence (in base 31); skipping an initial portion of the sequence 
would reduce the severity of this problem but would not eliminate it. 
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Fig. 5.17. Cumulative average approximation to a 30-dimensional option price 
using Faure points. The left panel magnifies the inset in the right panel. The ap- 
proximation approaches the true value through plateaus that create the appearance 


of convergence. 


Next we compare randomized QMC point sets using a random shift modulo 
1 as in (5.32). For this comparison we consider a single option — a call on 
the geometric average of five assets, with T = 0.25, K = 100, and o = 0.45. 
Because of the randomization, we can now compare methods based on their 
variances; these are displayed in Table 5.4. To compensate for differences in 
the cardinalities of the point sets, we report a product no’, where n is the 
number of points in the set and ø? is the variance of the average value of the 
integrand over a randomly shifted copy of the point set. This measure makes 
the performance of ordinary Monte Carlo independent of the choice of n. 

For the Faure and Sobol’ results, we generated each point set of size n by 
starting at the nth point in the sequence; each n is a power of the correspond- 
ing base. The lattice rules are the same as those used for Figure 5.14. For 
the Korobov rules we display the number of points and the multiplier a; these 
values are from L’Ecuyer and Lemieux [226]. 

All the QMC methods show far smaller variance than ordinary Monte 
Carlo. Sobol’ points generally appear to produce the smallest variance, but 
the smallest variance overall corresponds to a lattice rule. The Korobov rules 
have larger variances than the other methods. 

The numerical examples considered here suggest some general patterns: 
the QMC methods produce substantially more precise values than ordinary 
Monte Carlo; this holds even at rather small values of n, before O(1/n'~*) 
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Lattice Korobov Faure Sobol’ Monte Carlo 


Table 5.4. Variance comparison for randomly shifted QMC methods and Monte 
Carlo. 


convergence is evident; Sobol’ points generally produce smaller errors than 
Faure points or lattice rules; the advantages of QMC persist even in rather 
high dimensions, especially for Sobol’ points; randomized QMC point sets 
produce low-variance estimates. 

The effectiveness of QMC methods in high-dimensional pricing problems 
runs counter to the traditional view that these methods are unsuitable in 
high dimensions. The traditional view is rooted in the convergence rate of 
O((logn)¢/n): if d is large then n must be very large for the denominator 
to overwhelm the numerator. The explanation for this apparent contradic- 
tion may lie in the structure of problems arising in finance — these high- 
dimensional integrals might be well-approximated by much lower-dimensional 
integrals, a possibility we exploit in Section 5.5.2. 

Extrapolating from a limited set of examples (we have considered just one 
type of option and just one model of asset price dynamics) is risky, so we 
comment on results from other investigations. Acworth et al. [4] and Boyle 
et al. [53] find that Sobol’ points outperform Faure points and that both 
outperform ordinary Monte Carlo in comparisons similar to those reported 
here. Morland [270] reports getting better results with Sobol’ points than 
Niederreiter points (of the type generated in Bratley, Fox, and Niederreiter 
[58]). Joy et al. [204] test Faure sequences on several different types of options, 
including an HJM swaption pricing application, and find that they work well. 
Berman [45] compares methods on a broad range of options and models; he 
finds that Sobol’ points give more precise results than ordinary Monte Carlo, 
but he also finds that with some simple variance reduction techniques the 
two methods perform very similarly. Paskov [295], Paskov and Traub [296], 
and Caflisch et al. [73] find that Sobol’ points work well in pricing mortgage- 
backed securities formulated as 360-dimensional integrals. Papageorgiou and 
Traub [293] report improved results using a generalized Faure sequence and 
Ninomiya and Tezuka [283] report superior results for similar problems using a 
generalized Niederreiter sequence, but neither specifies the exact construction 
used. 

For the most part, these comparisons (like those presented here) pit QMC 
methods against only the simplest form of Monte Carlo. Variance reduction 
techniques can of course improve the precision of Monte Carlo estimates; they 
provide a mechanism for taking advantage of special features of a model to 
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a much greater extent than QMC. Indeed, the “black-box” nature of QMC 
methods is part of their appeal. As discussed in Section 5.1.3, the ready avail- 
ability of error information through confidence intervals is an advantage of 
Monte Carlo methods. 

For calculations that need to be repeated often with only minor changes 
in parameters — for example, options that need to be priced every day — 
this suggests the following approach: tailor a Monte Carlo method to the 
specific problem, using estimates of standard errors to compare algorithms, 
and determine the required sample size; once the problem and its solution are 
well understood, replace the random number generator with a quasi-Monte 
Carlo generator. 


5.5.2 Strategic Implementation 


QMC methods have the potential to improve accuracy for a wide range of 
integration problems without requiring an integrand-specific analysis. There 
are, however, two ways in which the application of QMC methods can be 
tailored to a specific problem to improve performance: 


(i) changing the order in which coordinates of a sequence are assigned to 


arguments of an integrand; 
(ii) applying a change of variables to produce a more tractable integrand. 


The first of these transformations is actually a special case of the second but 
it merits separate consideration. 

The strategy in (i) is relevant when some coordinates of a low-discrepancy 
sequence exhibit better uniformity properties than others. This holds for Hal- 
ton sequences (in which coordinates with lower bases are preferable) and for 
Sobol’ sequences (in which coordinates generated by lower-degree polynomials 
are preferable), but not for Faure sequences. As explained in Section 5.2.2, 
all coordinates of a Faure sequence are equally well distributed. But the more 
general strategy in (ii) is potentially applicable to all QMC methods. 

As a simple illustration of (ii), consider the function on [0,1)° defined by 


f (ur, U2, Us, Ua, Us) = 1{[®~*(ua)| + [B7 (us)| < 2V2}, 


with ®~! the inverse cumulative normal distribution. Although this reduces to 
a bivariate integrand, we have formulated it as a five-dimensional problem for 
purposes of illustration. The integral of this function is the probability that 
a pair of independent standard normal random variables fall in the square 
in R? with vertices (0, +2./2) and (+2./2,0), which is approximately 0.9111. 
Applying an orthogonal transformation to a pair of independent standard nor- 
mal random variables produces another pair of independent standard normal 
random variables, so the same probability applies to the rotated square with 
vertices (+2, +2). Thus, a change of variables transforms the integrand above 


to 
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f (ur, U2, U3, U4, Us) = 1{max(|®~*(u4)|, [®t (us)|) < 2}. 


Figure 5.18 compares the convergence of QMC approximations to f (the dot- 
ted line) and f (the solid line) using a five-dimensional Faure sequence starting 
at the 625th point. In this example, the rotation has an evident impact on 
the quality of the approximation: after 3125 points, the integration error for 
f is nearly four times as large as the error for f. That a rotation could affect 
convergence in this way is not surprising in view of the orientation displayed 
by Faure sequences, as in, e.g., Figure 5.3. 
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Fig. 5.18. Both squares on the left have probability 0.9111 under the bivariate 
standard normal distribution. The right panel shows the convergence of QMC ap- 
proximations for the probabilities of the two squares. The solid horizontal line shows 


the exact value. 


Assigning Coordinates 


We proceed with an illustration of strategy (i) in which the form of the inte- 
grand is changed only through a permutation ofits arguments. In the examples 
we considered in Section 5.5.1, the integrands are symmetric functions of their 
arguments because we took the underlying assets to be identical in (5.33) and 
we took the averaging dates to be equally spaced in (5.34). Changing the 
assignment of coordinates to variables would therefore have no effect on the 
value of a QMC approximation. 

To break the symmetry of the multi-asset option in Section 5.5.1, we assign 
linearly increasing volatilities oc; = ig1, i = 1,...,d, to the d assets. We take 
the volatility of the ith asset as a rough measure of the importance of the ith 
coordinate (and continue to assume the assets are uncorrelated). With this 
interpretation, assigning the coordinates of a Sobol’ sequence to the assets in 
reverse order should produce better results than assigning the ith coordinate 


to the ith asset. 
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To test this idea we take d = 30; the degrees of the primitive polynomials 
generating the coordinates then increase from 0 to 8. We compare straightfor- 
ward application of Sobol’ points with a reversed assignment of coordinates 
based on root mean square relative error. Given an average level of volatility 
5, we choose oj so that ° = (o? +---+04)/d, with o; = io. We let & range 
from 0.21 to 0.66 in increments of 0.05, and we let K vary from 94 to 102 in 
increments of 2 to get the same fifty option values we used for Figure 5.15. 

Table 5.5 displays the resulting RMS relative errors. The first row shows 
the number of points n. We specifically avoided values of n equal to powers 
of 2 in order to further differentiate the coordinates of the sequence; this 
makes the convergence of both methods erratic. In this example the reversed 
assignment usually produces smaller errors, but not always. 


750 1500 2500 3500 5000 7500 10000 = 12000 
Sobol’ 0.023 0.012 0.017 0.021 0.013 0.012 0.007 0.005 
Reverse 0.020 0.021 0.010 0.015 0.009 0.007 0.005 0.003 


Table 5.5. RMS relative errors for options on the geometric average of 30 assets 
with linearly increasing volatilities. Top row gives the number points. Second row is 
based on assigning ith coordinate to ith asset; last row uses reversed assignment. 


Changing Variables 


A general strategy for improving QMC approximations applies a change of 
variables to produce an integrand for which only a small number of arguments 
are “important” and then applies the lowest-indexed coordinates of a QMC 
sequence to those coordinates. Finding an effective transformation presents 
essentially the same challenge as finding good stratification variables, a topic 
treated in Section 4.3.2. As is the case in stratified sampling, the Gaussian 
setting offers particular flexibility. 

In the application of QMC to derivatives pricing, the integrand f subsumes 
the dynamics of underlying assets as well as the form of the derivative contract. 
In the absence of specific information about the payoff of a derivative, one 
might consider transformations tied to the asset dynamics. 

A simple yet effective example of this idea is the combination of Sobol’ 
sequences with the Brownian bridge construction of Brownian motion devel- 
oped in Section 3.1. In a straightforward application of Sobol’ points to the 
generation of Brownian paths, the zth coordinate of each point would be trans- 
formed to a sample from the standard normal distribution (using ®~'), and 
these would be scaled and summed using the random walk construction (3.2). 
To the extent that the initial coordinates of a Sobol’ sequence have uniformity 
superior to that of higher-indexed coordinates, this construction does a par- 
ticularly good job of sampling the first few increments of the Brownian path. 
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However, many option contracts would be primarily sensitive to the terminal 
value of the Brownian path. 

Through the Brownian bridge construction, the first coordinate of a Sobol’ 
sequence determines the terminal value of the Brownian path, so this value 
should be particularly well distributed. Moreover, the first several coordinates 
of the Sobol’ sequence determine the general shape of the Brownian path; the 
last few coordinates influence only the fine detail of the path, which is often 
less important. This combination of Sobol’ points with the Brownian bridge 
construction was proposed by Moskowitz and Caflisch [273] and has been 
found by several authors (including Acworth et al. [4], Akesson and Lehoczky 
[9], and Caflisch et al. [73]) to be highly effective in finance applications. 

As discussed in Section 3.1, the principal components construction of a 
discrete Brownian path (or any other Gaussian vector) has an optimality 
property that maximizes the importance (in the statistical sense of explained 
variance) of any initial number of independent normals used to construct the 
vector. Though this property lacks a precise relation to discrepancy, it suggests 
a construction in which the ith coordinate of a Sobol’ sequence is assigned 
to the zth principal component. Unlike the Brownian bridge construction, the 
principal components construction is applicable with any covariance matrix. 

This construction was proposed and tested in Acworth et al. [4]. Tables 5.6 
and 5.7 show some of their results. The tables report RMS relative errors com- 
paring an ordinary application of Sobol’ sequences with the Brownian bridge 
and principal components constructions. The errors are computed over 250 
randomly generated problem instances as described in [4]. Table 5.6 reports 
results for barrier options and geometric average options on a single underlying 
asset. The results indicate that both the Brownian bridge (BB) and princi- 
pal components (PC) constructions can produce substantial error reductions 
compared to straightforward application of Sobol’ points in a random walk 
construction. This is particularly evident at smaller values of n. 

Table 5.7 shows results for options on the geometric average of d assets. 
The Brownian bridge construction is inapplicable in this setting, so only an 
ordinary application of Sobol’ points (using Cholesky factorization) and the 
principal components construction appear in the table. These methods are 
compared for uncorrelated assets and assets for which all correlations are 
0.3. In the case of uncorrelated assets, the principal components construction 
simply permutes the coordinates of the Sobol’ sequence, assigning the ith 
coordinate to the asset with the zth largest volatility. This suggests that the 
differences between the two methods should be greater in the correlated case, 
and this is borne out by the results in the table. 

Neither the Brownian bridge nor the principal components construction 
is tailored to a particular type of option payoff. Given additional information 
about a payoff, we could try to find still better changes of variables. As an 
example, consider again an option on the geometric mean of d uncorrelated 
assets. A standard simulation would map a point (u1,..., ua) € [0,1)% to a 
value of the average S in (5.33) by first mapping each u; to ®~'(u;) and then 
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Barrier Options Average Options 
Sobol’? BB PC Sobol’? BB PC 
d= 10, n = 1,250 1.32 0.78 0.97 2.14 0.71 0.32 
5,000 0.75 0.41 0.49 0.18 0.24 0.11 
20,000 0.48 0.53 0.50 0.08 0.08 0.02 
80,000 0.47 0.47 0.47 0.03 0.03 0.01 
d = 50, n = 1,250 7.10 1.14 1.18 4.24 0.53 0.33 
5,000 1.10 0.87 0.59 0.61 0.16 0.11 
20,000 0.30 0.25 0.31 0.24 0.05 0.02 
80,000 0.22 0.12 0.08 0.06 0.03 0.01 
d = 100, n = 1,250 9.83 1.32 1.41 10.12 0.63 0.33 
5,000 1.70 0.91 0.46 1.27 0.18 0.11 
20,000 0.62 0.23 0.28 0.24 0.04 0.02 
80,000 0.19 0.09 0.11 0.05 0.03 0.01 


Table 5.6. RMS relative errors (in percent) for single-asset options with d steps per 
path and n paths, using three different constructions of the underlying Brownian 


paths. 


setting 


X ob (u) 


t=] 


This reduces the problem to a one-dimensional integral and uses the first 
coordinate u; for that integration. This example is certainly not typical, but it 
illustrates the flexibility available to change variables, particularly for models 
driven by normal random variables. All of the examples of stratified sampling 
in Section 4.3.2 can similarly be applied as changes of variables for QMC 
methods. Further strategies for improving the accuracy of QMC methods are 
developed in Fox [127]. 


5.6 Concluding Remarks 


The preponderance of the experimental evidence amassed to date points to 
Sobol’ sequences as the most effective quasi-Monte Carlo method for appli- 
cations in financial engineering. They often produce more accurate results 
than other QMC and Monte Carlo methods, and they can be generated very 
quickly through the algorithms of Bratley and Fox [57] and Press et al. [299]. 
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Correlation 0 Correlation 0.3 

Sobol’ PC Sobol’ PC 

d = 10, n = 1,250 1.20 1.01 1.03 0.23 
5,000 0.37 0.50 0.17 0.06 

20,000 0.19 0.20 0.06 0.02 

80,000 0.06 0.03 0.04 0.01 

d= 00) n = 1.250 3:05 2.45 1.58 0.16 
5,000 0.50 0.34 0.21 0.05 

20,000 0.18 0.08 0.05 0.02 

80,000 0.08 0.04 0.04 0.01 

d = 100, n = 1,250 3.18 3.59 2.15 0.16 
5,000 0.53 0.56 0.34 0.04 

20,000 0.13 0.10 0.06 0.02 

80,000 0.07 0.02 0.03 0.00 


Table 5.7. RMS relative errors (in percent) for options on the geometric average 
of d assets using n paths. 


Although QMC methods are based on a deterministic perspective, the 
performance of Sobol’ sequences in derivatives pricing can often be improved 
through examination of the underlying stochastic model. Because the initial 
coordinates of a Sobol’ sequence are more uniform than later coordinates, 
a strategic assignment of coordinates to sources of randomness can improve 
accuracy. The combination of Sobol’s points with the Brownian bridge con- 
struction is an important example of this idea, but by no means the only 
one. The applications of stratified sampling in Section 4.3.2 provide further 
examples, because good directions for stratification are also good candidates 
for effective use of the best Sobol’ coordinates. 

One might consider applying methods from Chapter 4 — a control variate, 
for example — in a QMC numerical integration. We prefer to take such combi- 
nations in the opposite order: first analyze a stochastic problem stochastically 
and use this investigation to find an effective variance reduction technique; 
then reformulate the variance-reduced simulation problem as an integration 
problem to apply QMC. Thus, one might develop an importance sampling 
technique and then implement it using QMC. It would be much more diff- 
cult to derive effective importance sampling methods of the type illustrated 
in Section 4.6 starting from a QMC integration problem. 

Indeed, we view postponing the integration perspective as a good way to 
apply QMC techniques to stochastic problems more generally. The transfor- 
mation to a Brownian bridge construction, for example, is easy to understand 
from a stochastic perspective but would be opaque if viewed as a change of 
variables for an integration problem. Also, the simple error estimates provided 
by Monte Carlo simulation are especially useful in developing and comparing 
algorithms. After finding a satisfactory algorithm one may apply QMC to try 
to further improve accuracy. This is particularly useful if similar problems 
need to be solved repeatedly, as is often the case in pricing applications. Ran- 
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domized QMC methods make it possible to compute simple error estimates 
for QMC calculations and can sometimes reduce errors too. 

The effectiveness of QMC methods in high-dimensional pricing problems 
cannot be explained by comparing the O(1/,/n) convergence of Monte Carlo 
with the O(1/n1~*) convergence of QMC because of the (logn)? factor sub- 
sumed by the e. An important part of the explanation must be that the main 
source of dimensionality in most finance problems is the number of time steps, 
and as the Brownian bridge and principal components constructions indicate, 
this may artificially inflate the nominal dimension. Recent work has identified 
abstract classes of integration problems for which QMC is provably effective in 
high dimepisions because of the diminishing importance of higher dimensions; 
see Sloan and Wozniakowski [334] for a detailed analysis, Sloan [332] for an 
overview, and Larcher, Leobacher, and Scheicher [220] for an application of 
these ideas to the Brownian bridge construction. Owen [291] argues that the 
key requirement for the effectiveness of QMC in high dimensions is that the 
integrand be well-approximated by a sum of functions depending on a small 
number of variables each. 
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6 
Discretization Methods 


This chapter presents methods for reducing discretization error — the bias 
in Monte Carlo estimates that results from time-discretization of stochastic 
differential equations. Chapter 3 gives examples of continuous-time stochas- 
tic processes that can be simulated exactly at a finite set of dates, meaning 
that the joint distribution of the simulated values coincides with that of the 
continuous-time model at the simulated dates. But these examples are excep- 
tional and most models arising in derivatives pricing can be simulated only 
approximately. The simplest approximation is the Euler scheme; this method 
is easy to implement and almost universally applicable, but it is not always 
sufficiently accurate. This chapter discusses methods for improving the Euler 
scheme and, as a prerequisite for this, discusses criteria for comparing dis- 
cretization methods. 

The issues addressed in this chapter are orthogonal to those in Chap- 
ters 4 and 5. Once a time-discretization method is fixed, applying a variance 
reduction technique or quasi-Monte Carlo method may improve precision in 
estimating an expectation at the fixed level of discretization, but it can do 
nothing to reduce discretization bias. 


6.1 Introduction 


We begin by discussing properties of the Euler scheme, the simplest method 
for approximate simulation of stochastic differential equations. We then un- 
dertake an expansion to refine the Euler scheme and present criteria for com- 


paring methods. 


6.1.1 The Euler Scheme and a First Refinement 


We consider processes X satisfying a stochastic differential equation (SDE) 


of the form 
dX (t) = a(X(t)) dt + b(X (t)) dW (t), (6.1) 
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usually with X(0) fixed. In the most general setting we consider, X takes 
values in R? and W is an m-dimensional standard Brownian motion, in which 
case a takes values in R? and b takes values in R?%™. Some of the methods in 
this chapter are most easily introduced in the simpler case of scalar X and W. 
The coefficient functions a and b are assumed to satisfy the conditions in Ap- 
pendix B.2 for existence and uniqueness of a strong solution to the SDE (6.1); 
indeed, we will need to impose stronger conditions to reduce discretization 
error. 

We use X to denote a time-discretized approximation to X. The Euler (or 
Euler-Maruyama, after [254]) approximation on a time grid 0 = tọ < tı < 
--+ < tm is defined by X(0) = X(0) and, for i=0,...,m—1, 


X (tiga) = X (ti) + a(X (t) [tiga — ti] + O(X (ts) tiri — G Zig, 


with Z1, Z2,... independent, m-dimensional standard normal random vectors. 
To lighten notation, we restrict attention to a grid with a fixed spacing h, 
meaning that t; = ih. Everything we discuss carries over to the more general 
case provided the largest of the increments t;+ı —t; decreases to zero. Adaptive 
methods, in which the time steps depend on the evolution of X and are thus 
stochastic, require separate treatment; see, for example, Gaines and Lyons 
[133]. 
With a fixed time step h > 0, we may write X(ih) as X(i) and write the 
Euler scheme as 


X(i+1) = X(i) + a(X(4))h +(X ())VhZi+1. (6.2) 


Implementation of this method is straightforward, at least if a and b are easy 
to evaluate. Can we do better? And in what sense is one approximation better 
than another? These are the questions we address. 

In the numerical solution of ordinary differential equations, methods of 
higher-order accuracy often rely on ‘Taylor expansions. If b were identically 
zero (and thus (6.1) non-stochastic), (6.2) would reduce to a linear approxi- 
mation, and a natural strategy for improving accuracy would include higher- 
order terms in a Taylor expansion of a(X(t)). A similar strategy applies to 
stochastic differential equations, but it must be carried out consistent with 
the rules of Ito calculus rather than ordinary calculus. 


A First Refinement 


Inspection of the Euler scheme (6.2) from the perspective of Taylor expansion 
suggests a possible inconsistency: this approximation expands the drift to O(h) 
but the diffusion term only to O( Vh). The approximation to the diffusion term 
omits O(h) contributions, so including a term of order h in the drift looks like 
spurious accuracy. ‘This discrepancy also suggests that to refine the Euler 
scheme we may want to focus on the diffusion term. 
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We now carry out this proposal. We will see, however, that whether or not 
it produces an improvement compared to the Euler scheme depends on how 


we measure error. 
We start with the scalar case d = m = 1. Recall that the SDE (6.1) 


abbreviates the relation 
t t 
X(t) = X06) +f a(X (u)) du +f b(X (u)) dW (u). (6.3) 
0 0 
The Euler scheme results from the approximations 
t+h 
/ a(X(u)) du x a( X(t))h (6.4) 
t 


and 
t+h 
J b(X(u)) dW(u) = a ew. (6.5) 


In both cases, an integrand over [t,t +h] is approximated by its value at t. To 
improve the approximation of the diffusion term, we need a better approxima- 
tion of b(X(u)) over an interval |t,¢ +h]. We therefore examine the evolution 


of b(X(u)). 


From It6’s formula we get 
db( X (t)) 
= b'(X(t)) dX (t) + 7b"(X(£))b°(X(t)) dt 

= [b'(X(t))a(X(t)) + 30” (X (EX ())] dt + (X(t) b(X (4) dW (t) 

= p(X (t)) dt + on(X(t)) dW (t), 
where b’ and b” are the first and second derivatives of b. Applying the Euler 


approximation to the process b(X (t)) results in the approximation of b( X (u)), 
t<u<t+h by 


b(X(u)) = B(X(t)) + po(X (4) [u — t] + (X [W (u) — W &)] 
= b(X (t)) + (V(X (E) X (t)) + 30” (X H(X (6))) lu — t] 
HOX (EOX (t)) [W(u) — W (¢)]. 
Now W (u)— W (t) is O(\/u — t) (in probability) whereas the drift term in this 


approximation is O(u—t) and thus of higher order. Dropping this higher-order 
term yields the simpler approximation 


b(X(u)) = b(X (t)) + OX (t))o(X (1) [W(u) —W(t)], we lt,t+h]. (6.6) 


Armed with this approximation, we return to the problem of refining (6.5). 
Instead of freezing b(X (u)) at b(X(t)) over the interval [t,t + h], as in (6.5), 
we use the approximation (6.6). Thus, we replace (6.5) with 
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t+h 

/ b(X (u)) dW (u) 
t+h 

/ (D(X (t)) +(X (EOX (t) [W (u) — W (t)]) dW (u) 
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= (X(t) [W(t +h) -W &)] 
t+h 
+o (XEXE) ( f [W (u) — W(t)] aw) . (6.7) 


The proposed refinement uses this expression in place of b(Ñ(i))VhZi+1 in 
the Euler scheme (6.2). 
To make this practical, we need to simplify the remaining integral in (6.7). 
We can write this integral as 
t+h 
[ww -wolar 


t+h tth 
=f wwaw)-we l dW (u) 
-Y (t+ h) — Y(&) -WOW +h) — Wi) (6.8) 


with 
Y(t) = f W (t) dW (2): 


i.e., Y (0) = 0 and 
dY (t) = W(t) dW (t). 


Itô’s formula verifes that the solution to this SDE is 
eai, 


Y (t) i 5W (t) 2 


Making this substitution in (6.8) and simplifying, we get 
(6.9) 


s(W(t +h) — W(t)]? — Sh. 


t+h 
J [W (u) — W (t)] dW (u) 


Using this identity in (6.7), we get 


t+h 
f UXD ara) = AXW E+ - WO] 
+50'(X (EAX) (Wt + h) — Wt)? — h). 


Finally, we use this approximation to approximate X (t+ h). We refine the 


one-step Euler approximation 


X(t+h) x X(t) +a(X(t))h + (X(t) [W(t +h) — W(t) 
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to 
X(t +h) = X(t) + a(X(t))h + OX HWE + h) - W) 
+350'(X(t))o(X (2) (W +h) — W(t)? — h) . 
In a simulation algorithm, we apply this recursively at h, 2h,..., replacing the 


increments of W with /hZ;.1; more explicitly, we have 


X(i +1) = X(i) + a(X(i))h t+ W(X (i) VAZiy1 
HOROR OMZ - 1), (6.10) 
This algorithm was derived by Milstein [266] through an analysis of partial 
differential equations associated with the diffusion X. It is sometimes called 
the Milstein scheme, but this terminology is ambiguous because there are 
several important methods due to Milstein. 
The approximation method in (6.10) adds a term to the Euler scheme. It 
expands both the drift and diffusion terms to O(h). Observe that, conditional 
on X(i), the new term 


zO (XMARA — 1) 


has mean zero and is uncorrelated with the Euler terms because Z?,, — 1 
and 24,41 are uncorrelated. The question remains, however, whether and in 
what sense (6.10) is an improvement over the Euler scheme. We address this 
in Section 6.1.2, after discussing the case of vector-valued X and W. 


The Multidimensional Case 


Suppose, now, that X(t) € R and W(t) € R”. Write X;, Wi, and a; for the 
ith components of X, W, and a, and write b;; for the 7j-entry of b. Then 


X(t +h) = X;(t) + Ta (X (u)) du + d bi; (X (u)) dW; (u), 


and we need to approximate the integrals on the right. As in the Euler scheme, 
we approximate the drift term using 


t+h 
J a;( X (u)) du & a;(X (t))h. 
t 
The argument leading to (6.7) yields 


t+h 
[bs Xu) Wa x by XE h) = Wi) 


d m Əb;; t+h 
es an (X (t) )bex(X (t)) J [Wp (u) — Wp (t)] dW; (u). (6.11) 


A—1 Lat 
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For k = j, we can evaluate the integral in (6.11) as in the scalar case: 


t+h 
| F [W; (u) = W; (t)| dW; (u) = IW; (t F h) — W(t)? Pz Lp, 


However, there is no comparable expression for the off-diagonal terms 


t+h 
/ Wi (u) — Welt) dWj(u), kx Gg 


These mixed integrals (or more precisely their differences) are called Lévy 
area terms; see the explanation in Protter [300, p.82], for example. Generating 
samples from their distribution is a challenging simulation problem. Methods 
for doing so are developed in Gaines and Lyons [132] and Wiktorsson [356], 
but the difficulties involved limit the applicability of the expansion (6.11) in 
models driven by multidimensional Brownian.motion. Fortunately, we will see 
that for the purpose of estimating an expectation it suffices to simulate rough 
approximations to these mixed Brownian integrals. 


6.1.2 Convergence Order 


Equation (6.10) displays a refinement of the Euler scheme based on expanding 
the diffusion term to O(h) rather than just O(Wh). To discuss the extent and 
the sense in which this algorithm is an improvment over the Euler scheme, we 
need to establish a figure of merit for comparing discretizations. 

Two broad categories of error of approximation are commonly used in 
measuring the quality of discretization methods: criteria based on the path- 
wise proximity of a discretized process to a continuous process, and criteria 
based on the proximity of the corresponding distributions. These are generally 
termed strong and weak criteria, respectively. 

Let {X(0), X(h), X(2h),...} be any discrete-time approximation to a 
continuous-time process X. Fix a time T and let n = |T/h]|. Typical strong 


error criteria are 


E IRA- XDI], E [Ren - x(D)P], 


and 
E [ sop X(Le/mja) = XON, 
O<t<T 
for some vector norm ||- ||. Each of these expressions measures the deviation 


between the individual values of X and the approximation X. 
In contrast, a typical weak error criterion has the form 


EL X(na))] — ELX T], (6.12) 
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with f ranging over functions from R? to R typically satisfying some smooth- 
ness conditions. Requring that an expression of the form (6.12) converge to 
zero as h decreases to zero imposes no constraint on the relation between the 
outcomes of X(nh) and X(T); indeed, the two need not even be defined on 
the same probability space. Making the error criterion (6.12) small merely 
requires that the distributions of X(nh) and X(T) be close. 

For applications in derivatives pricing, weak error criteria are most rele- 
vant. We would like to ensure that prices (which are expectations) computed 
from X are close to prices computed from X; we are not otherwise concerned 
about the paths of the two processes. It is nevertheless useful to be aware of 
strong error criteria to appreciate the relative merits of alternative discretiza- 
tion methods. 

Even after we fix an error criterion, it is rarely possible to ensure that 
the error using one discretization method will be smaller than the error using 
another in a specific problem. Instead, we compare methods based on their 
asymptotic performance for small h. 

Under modest conditions, even the simple Euler scheme converges (with 
respect to both strong and weak criteria) as the time step h decreases to zero. 
We therefore compare discretization schemes based on the rate at which they 
converge. Following Kloeden and Platen [211], we say that a discretization X 
has strong order of convergence 2 > 0 if 


E |X (nh) 2s X(T)Il < ch? (6.13) 


for some constant c and all sufficiently small h. ‘The discretization scheme has 
weak order of convergence ĝ if 


ELf(X(nh))] - E(X (T))]| < ch? (6.14) 


for some constant c and all sufficiently small h, for all f in a set oe uo, 
The set Ce +2 consists of functions from R? to R whose derivatives of order 
0,1,...,28 + 2 are polynomially bounded. A function g : R? — R is polyno- 
mially bounded if 

Ig(a)| < kA + |fzll’) 
for some constants k and q and all x € R. The constant c in (6.14) may 


depend on f. 

In both (6.13) and (6.14), a larger value of 3 implies faster convergence 
to zero of the discretization error. The same scheme will often have a smaller 
strong order of convergence than its weak order of convergence. For example, 
the Euler scheme typically has a strong order of 1/2, but it often achieves a 


weak order of 1. 


Convergence Order of the Euler Scheme 


In more detail, the Euler scheme has strong order 1/2 under conditions only 
slightly stronger than those in Theorem B.2.1 of Appendix B.2 for existence 
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and uniqueness of a (strong) solution to the SDE (6.1). We may generalize 
(6.1) by allowing the coefficient functions a and b to depend explicitly on 
time ¢ as well as on X(t). Because X is vector-valued, we could alternatively 
take t to be one of the components of X(t); but that formulation leads to 
unnecessarily strong conditions for convergence because it requires that the 
coefficients be as smooth in t as they are in X. In addition to the conditions 
of Theorem B.2.1, suppose that 


E [||X(0) - XO)|?] < KvA (6.15) 


and 
a(z, s) — alz, t)|| + llb(z, s) — (a, t) < KA + |al)Vlé—s], (6.16) 


for some constant K; then the Euler scheme has strong order 1/2. (This is 
proved in Kloeden and Platen [211], pp.342-344. It is observed in Milstein [266] 
though without explicit hypotheses.) Condition (6.15) is trivially satisfied if 
X (0) is known and we set X(0) equal to it. 

Stronger conditions are required for the Euler scheme to have weak order 
1. For example, Theorem 14.5.2 of Kloeden and Platen [211] requires that the 
functions a and b be four times continuously differentiable with polynomially 
bounded derivatives. More generally, the Euler scheme has weak order £ if a 
and b are 2(3+1) times continuously differentiable with polynomially bounded 
derivatives; the condition (6.14) then applies only to functions f with the same 
degree of smoothness. 

To see how smoothness can lead to a higher weak order than strong order, 
consider the following argument. Suppose, for simplicity, that T = nh and 
that X(0) is fixed so that E[f(X(0))] is known. By writing 


n—i 


E[f(X(T))] = E[f(X(0))] + E | X E(X (G + 1)A)) — F(X (ih))|X (ih)] | , 
i=0 
we see that accurate estimation of E[f(X(T))] follows from accurate esti- 
mation of the conditional expectations E[f(X((i + 1)h)) — f(X (ih) |X (ih)]. 
Applying a Taylor approximation to f (and taking X scalar for simplicity), 
we get 


E(f(X((i + 1)h)) — f(X (ih))| X (ih)] 
Tf) i 
x à E O EG +1)h) — X (ih) |X (ih)]. (6.17) 


Thus, if f is sufficiently smooth, then to achieve a high order of weak conver- 
gence a discretization scheme need only approximate conditional moments of 
the increments of the process X. With sufficient smoothness in the coefficient 
functions a and b, higher conditional moments are of increasingly high order 
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in h. Smoothness conditions on a, b, and f leading to a weak order of conver- 
gence 8 for the Euler scheme follow from careful accounting of the errors in 
expanding f and approximating the conditional moments; see Kloeden and 
Platen [211], Section 14.5, and Talay [340, 341]. 

The accuracy of a discretization scheme in estimating an expression of 
the form E[f(X(T))| does not necessarily extend to the simulation of other 
quantities associated with the same process. In Section 6.4 we discuss difficul- 
ties arising in simulating the maximum of a diffusion, for example. ‘Talay and 
Zheng [344] analyze discretization error in estimating quantiles of the distri- 
bution of a component of X(T). They provide very general conditions under 
which the bias in a quantile estimate computed from an Euler approxima- 
tion is O(h); but they also show that the implicit constant in this O(h) error 
is large — especially in the tails of the distribution — and that this makes 
accurate quantile estimation difficult. 


Convergence Order of the Refined Scheme 


Theorem 10.3.5 of Kloeden and Platen [211] and Theorem 2-2 of Talay [340] 
provide conditions under which Milstein’s refinement (6.10) and its multidi- 
mensional generalization based on (6.11) have strong order 1. The conditions 
required extend the linear growth, Lipschitz condition, and (6.16) to deriva- 
tives of the coefficient functions a and b. Thus, under these relatively modest 
additional conditions, expanding the diffusion term to O(h) instead of just 
O(Wh) through the derivation in Section 6.1.1 increases the order of strong 
convergence. 

But the weak order of convergence of the refined scheme (6.10) is also 1, 
as it is for the Euler scheme. In this respect, including additional terms — 
as in (6.10) and (6.11) — does not result in greater accuracy. This should 
not be viewed as a deficiency of Milstein’s method; rather, the Euler scheme 
is better than it “should” be, achieving order-1 weak convergence without 
expanding all terms to O(h). This is in fact just the simplest example of a 
broader pattern of results on the number of terms required to achieve strong 
or weak convergence of a given order (to which we return in Section 6.3.1). 
In order to achieve a weak order greater than that of the Euler scheme, we 
need to expand dt-integrals to order h? and stochastic integrals to order h. 
We carry this out in the next section to arrive at a method with a higher weak 
order of convergence. 

It is reassuring to know that a discretization scheme has a high order of 
convergence, but before venturing into our next derivation we should take 
note of the fact that good accuracy on smooth functions may not be directly 
relevant to our intended applications: option payoffs are typically nondiffer- 
entiable. Bally and Talay [34] show that the weak order of the Euler scheme 
holds for very general f and Yan [357] analyzes SDEs with irregular coeffi- 
cients, but most of the literature requires significant smoothness assumptions. 
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When applying higher-order discretization methods, it is essential to test the 
methods numerically. 


6.2 Second-Order Methods 


We now proceed to further refine the Euler scheme to arrive at a method 
with weak order 2. The derivation follows the approach used in Section 6.1.1, 
expanding the integrals of a(X(t)) and b(X(t)) to refine the Euler approxi- 
mations in (6.4) and (6.5), but now we keep more terms in the expansions. 
We begin by assuming that in the SDE (6.1) both X and W are scalar. 


6.2.1 The Scalar Case 


To keep the notation manageable, we adopt some convenient shorthand. With 
the scalar SDE (6.1) defining X, we associate the operators 


d d? 
and 
1 d 


meaning that for any twice differentiable f, we have 
L? f(x) = a(z) f'(x) + 1? (2) f" (2) 


and 


L* f(x) = (2) f(a). 


This allows us to write Ito’s formula as 
df (X(t)) = L°f(X(t)) dt + Lf (X(t) dW E). (6.20) 


To accommodate functions f(t, X (t)) that depend explicitly on time, we would 
generalize (6.18) to 


As in Section 6.1.1, the key to deriving a discretization scheme lies in 
approximating the evolution of X over an interval ft, t+ h]. We start from the 
representation 


t+h t+h 
X(t+h)=X(t)+ / a(X(u)) du + J b(X(u))dW (u), (6.21) 


and approximate each of the two integrals on the right. 
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The Euler scheme approximates the first integral using the approximation 
a(X(u)) ~ a(X(t)) for u € [t,t + h]. To derive a better approximation for 
a(X(u)), we start from the exact representation 


a(X(u)) = a(X(t)) + | * f2a(X(s)) ds + / " L1a(X(s)) dW(s): 


this is Ito’s formula applied to a(X(u)). Next we apply the Euler approx- 
imation to each of the two integrals appearing in this representation; in 
other words, we set L°a(X(s)) ~ L°a(X(t)) and L'a(X(s)) ~ L'al X (t)) 
for s € |t, u] to get 


a(X(u)) x a(X(t)) + L°a( X(t) T ds + L'al X (t)) a dW (s). 


t 


Now we use this aproximation in the first integral in (6.21) to get 


t+h 
/ O 
t+h u t+h u 
x a(X Hh + £oa(X(t)) / / bns aw) / | Awan 
= a(X(t))h + £Loa(X(t))Iioo) + LialX(t))L.0), (6.22) 


with o,o) and /(;,9) denoting the indicated double integrals. This gives us 
our approximation to the first term in integral in (6.21). 
We use corresponding steps for the second integral in (6.21). We approxi- 


mate the integrand b(X(u)), u € [t,t + A] using 


b(X(u)) = b(X(t)) + / " £%%(X(s)) ds + | i L'D(X (s)) dW (s8) 


t t 


~ b(X(t)) + L°b(X (t) f ds + £Lb(X(t)) a dW (s) 


t 


and thus approximate the integral as 


t+h 
J b(X (u)) dW (u) 


X 


tth u 
b(X(t)) [W(t + h) — WE] + LUXE) J / ds dW (u) 


t+h u 
+ LAX) / / dW (s) dW (u) 
= AXW Eh) — WO] + LOX) Koa) + LAX H))Ia,: (6.23) 


Once again, the Iq, j) denote the indicated double integrals. 
If we combine (6.22) and (6.23) and make explicit the application of the 
operators L? and L! to a and b, we arrive at the approximation 
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X(t + h) © X(t) + ah + bAW + (aa’ + $b*a""\I(o.0) 
+(ab! + $b°b")I(o,1) + ba’I(1,0) + bb'Ia 1), (6.24) 
with AW = W(t+h) — W(t), and the functions a, b and their derivatives all 
evaluated at X(t). 
The Discretization Scheme 


To turn the approximation in (6.24) into an implementable algorithm, we need 
to be able to simulate the double integrals Iq; j). Clearly, 


t+h u ea 
L(0,0) =| j ds du = 5h ? 


From (6.9) we know that 


The term {(o,1) is 


Io) = [- [ ds dW (u) = [ve —t)dW(u). 


Applying integration by parts (which can be justified by applying Itô’s formula 
to tW (t)), we get 


tth 


l l t+h 
shue -wo / [W (u) — W(t)] du 
= hAW — Ia o). (6.25) 


So, it only remains to examine 
t+h 
Tao) =f W(u)- WO) du. 
t 


Given W(t), the area I(;9) and the increment AW = W(t + h) — W(t) 
are jointly normal. Each has conditional mean 0; the conditional variance of 
AW is h and that of J(1,9) is h°/3 (see (3.48)). For their covariance, notice 
first that 

Ela o |W (t), AW] = shAW (6.26) 
(as illustrated in Figure 6.1), so Ela, AW] = sh?. We may therefore simu- 
late W(t + h) — W(t) and Iq, as 
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AW h 2h? 
(Mar) 8 (0% (ape ap (6.27) 
2 3 


This leads to the following second-order scheme: 
X((i+ Dh) = X(th) + ah + BAW + (ab! + 2b°b")[AWh — AT] 
+ a'bAl + $bb'[AW? — h] 
+ (aa’ + $b*a")sh?, (6.28) 


with the functions a, b and their derivatives all evaluated at X(ih). 


Fig. 6.1. The shaded area is AJ. Given W(t) and W(t + h), the conditional ex- 
pectation of W at any intermediate time lies on the straight line connecting these 
endpoints. The conditional expectation of AJ is given by the area of the triangle 
with base h and height AW = W(t+h) — W(t). 


This method was introduced by Milstein [267] in a slightly different form. 
Talay [341] shows that Milstein’s scheme has weak order 2 under conditions on 
the coefficient functions a and b. These conditions include the requirement that 
the functions a and b be six times continuously differentiable with uniformly 
bounded derivatives. The result continues to hold if AJ is replaced by its 
conditional expectation AWh/2; this type of simplification becomes essential 
in the vector case, as we explain in the next section. 

Implementation of (6.28) and similar methods requires calculation of the 
derivatives of the coefficient functions of a diffusion. Methods that use dif- 
ference approximations to avoid derivative calculations without a loss in con- 
vergence order are developed in Milstein [267] and Talay [341]. These types 
of approximations are called Runge-Kutta methods in analogy with methods 
used in the numerical solution of ordinary differential equations. 


6.2.2 The Vector Case 


We now extend the scheme in (6.28) to d-dimensional X driven by m- 
dimensional W. Much as in the scalar case, we start from the representation 


SOL o WViscretization Methods 
t+h m t+h 
Xit +h) =x(+ f ai(u)du+ X | biklu)dWg(u), i=1,...,d, 
f k=l“! 


and approximate each of the integrals on the right. In this setting, the relevant 
operators are 


9 d d m 
0o 1 
L = R r 2 3 w 5 Ox, (6.29) 
and 
d4 að 


The multidimensional Itô formula for twice continuously differentiable f : 
RI — R becomes 


PXE) = HXH) dt + Y LFX) Walt). (6.31) 
k=1 


Applying (6.31) to a;, we get 
ai(X(u)) = a(X()) + [Lal X(s))ds+ > [Chai X(s)) Wals) 
t pags 


The same steps leading to the approximation (6.22) in the scalar case now 
yield the approximation 


t+h ls 
/ a;(X(u)) du = a;(X (t))h + Lai(X (t))L(0,0) + pF La; (X (t) k, 0); 
t k=1 


t+h u 
L(k,0) =f / dW;,(s) du, RS heyyy MM: 


Similarly, the representation 


with 


bin( X (u)) = bir (X (t)) + [ L° big( X(s)) ds + Di L’ bip (X(s)) dW; (s8), 
leads to the approximation 
t+h 
f bX aw) 


j=l 
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t+h u 
L(0,k) =| / dsdW,(u), kK=1,...,m, 


tth pu 
Lij k) =f / dW; (u) dW;,(u), dike kevan 


The notational convention for these integrals should be evident: in [(; x) 
we integrate first over W; and then over Wp. This interpretation extends to 
j = Oif we set Wo(t) =t 

By combining the expansions above for the integrals of a; and bik, we 
arrive at the discretization 


with 


and 


X(t +h) = Xit) + a,(X())at 3 bin (X (t)) AW 
k=1 


%a.(X(t))A? + 3 L aj(X E) 


D ie bin (X t)) Io, k) -+ S Lib 8 X(t JMG, k ; (6.32) 


j=l 


for each i = ae , d. Here we have substituted h?/2 for I(9,9) and abbreviated 
W(t +h) — Wet P) as AW,. The application of each of the operators L/ to 
any of the coefficient functions a;i, bik produces a polynomial in the coefficient 
functions and their derivatives; these expressions can be made explicit using 
(6.29) and (6.30). Using the identity 


Loj) + 14,0) = AW;A, 


which follows from (6.25), we could rewrite all terms involving (9 ,;) as multi- 
ples of (AW;h — Iq;,o)) instead. Thus, to implement (6.32) we need to sample, 
for each j = 1,...,m, the Brownian increments AW; together with the inte- 
grals Iq o) and IG k); k =1,...,m. We address this issue next. 


Commutativity Condition 


As noted in Section 6.1.1, the mixed Brownian integrals Iç; x) with j # k are 
difficult to simulate, so (6.32) does not provide a practical algorithm without 
further simplification. Simulation of the mixed integrals is obviated in models 


satisfying the commutativity condition 


£*b;; = LI dix (6.33) 


for allt = 1,...,d. This is a rather artificial condition and is not often satisfied 
in practice, but it provides an interesting simplification of the second-order 


approximation. 
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When (6.33) holds, we may group terms in (6.32) involving mixed integrals 
IG k) j,k 2 1, and write them as 


SS bil yey = De eats 5 L bing Fea: 


4) k=l j=l k=j+1 


As in the scalar case (6.9), the diagonal term T; j) evaluates to (AW? — h)/2 
and is thus easy to simulate. The utility of the commutativity condition lies 
in the observation that even though each Iq; k), 7 # k, is difficult to simulate, 
the required sums simplify to 


Loki + Le = AWA Wp. (6.34) 


This follows from applying Itô’s formula to W; (t)Wzp (t) to get 


W; (t+h)W,(t+h) —W;(t)W;(t) = a W,(u) dW; (u)+ = W; (u) dW, (u) 


and then subtracting W(t) AW; + W;(t) AW; from both sides. 
When the commutativity condition is satisfied, the discretization scheme 


(6.32) thus simplifies to 


X(t +h) = ilt) + ul X (© )h + Yota (X (t)) AW, + 4£°a;(X (t))h? 

+ YO ([L¥ai(X() - Lobin(X())] Ate + £2 (X(Q)AW. A) 
k=l 

+X | by (R (t)) 4 (AW? — h) + 3 LI bin (X(t) AW; AW, |, (6.35) 
j=l k=j+1 


with Al, = ko). Because the components of W are independent of each 
other, the pairs (AW,, AIk), k = 1,...,m, are independent of each other. 
Each such pair has the bivariate normal distribution identified in (6.27) and 
is thus easy to simulate. 


Example 6.2.1 LIBOR Market Model. As an illustration of the commuta- 
tivity condition (6.33), we consider the LIBOR market model of Section 3.7. 
Thus, take X; to be the ith forward rate L; in the spot measure dynamics in 
(3.112). This specifies that the evolution of L; is governed by an SDE of the 
form 
dL; (t) = Lit) (L(t), t) dt + L;(t)o;(t)' dW (t), 

with, for example, o; a deterministic function of time. In the notation of this 
section, b;; = L,0;;. The commutativity condition (6.33) requires 
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d d 

obi; Obst 
oe a 
= K Oa, "FB, 


r= 


and this is satisfied because both sides evaluate to o4;0;,L;. More generally, 
the commutativity condition is satisfied whenever b;;(X(t)) factors as the 
product of a function of X;(t) and a deterministic function of time. 

If we set X;(t) = log L;(t) then X solves an SDE of the form 


dX; (t) = (mX, t) — llall) dt + oilt)! dW (t). 


In this case, bi; = ci; does not depend on X at all so the commutativity 
condition is automatically satisfied. O 


A Simplified Scheme 


Even when the commutativity condition fails, the discretization method (6.32) 
can be simplified for practical implementation. Talay [340] and Kloeden and 
Platen [211, p.465] show that the scheme continues to have weak order 2 if 
each AJ; is replaced with SAW; h. (Related simplifications are used in Milstein 
[267] and Talay [341].) Observe from (6.26) that this amounts to replacing AJ; 
with its conditional expectation given AW;. As a consequence, sAW;h has 
the same covariance with AW; as AJ; does: 


E[AW, - sAW;h] = ShE[AW?] = 5h’. 


It also has the same mean as AJ; but variance h?/4 rather than h°/3, an 
error of O(h?). This turns out to be close enough to preserve the order of 
convergence. In the scalar case (6.28), the simplified scheme is 


X(n+1) =X(n)+ah+bAw 
+ $(a’b+ ab! + $b°b")AWh + Łbb'[AW? — h] 
+ (aa’ + $b°a")Sh?, (6.36) 


with a, b, and their derivatives evaluated at X (n). 

In the vector case, the simplified scheme replaces the double integrals in 
(6.32) with simpler random variables. As in the scalar case, [(g,,) and I(,,0) 
are approximated by AW;h/2. Each I; j), 7 # 0, evaluates to (AW? —h)/2. 
For j,k different from zero and from each other, I; k) is approximated by 
(Talay [341], Kloeden and Platen [211], Section 14.2) 


5(AW; AW, — Vix); (6.37) 


with Vk; = —Vjx, and the Vik, 7 < k, independent random variables tak- 
ing values h and —h each with probability 1/2. Let V;; = h. The resulting 
approximation is, for each coordinate 7 = 1,...,d, 
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X,(n+1) = 
iln) + aih +Y big AWe + Loan? +4 Y (L'a; + L°bin) AWeh 
k=1 k=1 
+5S0S Libis (AW; AW, — Vje), (6.38) 
k=t3=t 


with all a;, bij, and their derivatives evaluated at X(n). 

In these simplified schemes, the AW can be replaced with other random 
variables AW with moments up to order 5 that are within O(h?) of those 
of AW. (See the discussion following (6.17) and, for precise results Kloeden ` 
and Platen [211, p.465] and Talay [341, 342].) This includes the three-point 
distributions 

P(AW = 4V3h) ==, P(AW =0) =< 
These are faster to generate, but using normally distributed AW will generally 
result in smaller bias. The justification for using (6.37) also lies in the fact 
that these simpler random variables have moments up to order five that are 
within O(h°) of those of the Ig x); see Section 5.12 of Kloeden and Platen [211, 
p.465], Section 1.6 of Talay [341], or Section 5 of Talay [342]. Talay [341, 342] 
calls these “Monte Carlo equivalent” families of random variables. 


Example 6.2.2 Stochastic volatility model. In Section 3.4, we noted that the 
square-root diffusion is sometimes used to model stochastic volatility. Heston’s 


[179] model is 


dS(t) = rS(t) dt + /V(t)S(t i (t) 
dV (t) = a E V( (t)) ) dt + V/V (ci dW, (t) + 09 dW2(t)), 


with S interpreted as, e.g., a stock price. The Brownian motions W; and Wọ 
are independent of each other. Heston [179] derives a formula for option prices 
in this setting using Fourier transform inversion. This provides a benchmark 
against which to compare simulation methods. 

The simplified second-order scheme (6.38) for this model is as follows: 


S(i \(L+rh + /P(i)AWi) + 42S @)h 
i G aad sli) /V (i) + + = | au) AW;,h 


+ 49(i)(V (a) + SAW? — h) + 4026(é)(AW2AW, + €) 


S(i+1)= 


and 
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ral 


V(i +1) = 
KOH + (1 — KAV (i) + /V (4)(o, AW, + op AW2) — 4K? (0 — V (i))h? 


0 5 1 ~ 
as z oe ‘| z = = V(i) (7, AW, + o2 AW2)h 
VVC) 

+ 407 (AW? — h) + 405(AWS — h) + $0102 AW, AW, 


with o? = o? + of and £ taking the values h and —h with probability 1/2 
independent of the Brownian increments. To avoid taking the square root of 
a negative number or dividing by zero, we replace V (i) by its absolute value 
before advancing these recursions. 

Figure 6.2 displays numerical results using this scheme and a simple Euler 
approximation. We use parameters S(0) = 100, V(0) = 0.04, r = 5%, k = 1.2, 
0 = 0.04, o = 0.30, and o1 = po with p = —0.5. Using Heston’s [179] formula, 
the expectation Ele~"! (S(T) — K)*] with T = 1 and K = 100 evaluates to 
10.3009. We compare our simulation results against this value to estimate 
bias. We use simulation time step h = T/n, with n = 3, 6, 12, 25, and 100 
and run 2—4 million replications at each n for each method. 

Figure 6.2 plots the estimated log absolute bias against logn. The bias 
in the Euler scheme for this example falls below 0.01 at n = 25 steps per 
year, whereas the second-order method has a bias this small even at n = 3 
steps per year. As n increases, the results for the Euler scheme look roughly 
consistent with first-order convergence; the second-order method produces 
smaller estimated biases but its convergence is much more erratic. In fact our 
use of (6.38) for this problem lacks theoretical support because the square-root 
functions in the model dynamics and the kink in the call option payoff violate 
the smoothness conditions required to ensure second-order convergence. The 
more regular convergence displayed by the Euler scheme in this example lends 
itself to the extrapolation method in Section 6.2.4. 


6.2.3 Incorporating Path-Dependence 


The error criterion in (6.14) applies to expectations of the form E[f(X(T))] 
with T fixed. Accurate estimation of E[f(X(T))] requires accurate approxi- 
mation only of the distribution of X(T). In many pricing problems, however, 
we are interested not only in the terminal state of an underlying process, but 
also in the path by which the terminal state is reached. The error criterion 
(6.14) does not appear to offer any guarantees on the approximation error in 
simulating functions of the path, raising the question of whether properties of 
the Euler and higher-order schemes extend to such functions. 

One way to extend the framework of the previous sections to path- 
dependent quantities is to transform dependence on the past into dependence 
on supplementary state variables. This section illustrates this idea. 
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Fig. 6.2. Estimated bias versus number of steps in discretization of a stochastic 


volatility model. 


Suppose we want to compute a bond price 


E E (- T r(t) a) (6.39) 


with the (risk-neutral) dynamics of the short rate r described by the scalar 


SDE 
dr(t) = p(r(t)) dt + o(r(t)) dW (t). 


If we simulate some discretization 7(1) = r(th), i = 0,1,.. 
step A = T/n, the simplest estimate of the bond price would be 


: n—1 
exp (50) (6.40) 
i=0 
An alternative introduces the variable 


Dex ¢ / ao iu) | 


develops a discretization scheme for the bivariate diffusion 


(dm) = (opo) #+(%") wo, ean 


. n — 1 with time 


and uses D(nh) as an estimate of the bond price (6.39). In (6.41), the driving 
Brownian motion is still one-dimensional, so we have not really made the prob- 
lem any more difficult by enlarging the state vector. The difficulties addressed 


in Section 6.2.2 arise when W is vector-valued. 
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The Euler scheme for the bivariate diffusion is 


PU +1) = FG) + aG) + ola) AW 


Daa hy = D(i) — #(i)D(a)h. 


Because of the smoothness of the coefficients of the SDE for D(t), this dis- 
cretization inherits whatever order of convergence the coefficients u and o 
ensure for fr. Beyond this guarantee, the bivariate formulation offers no clear 
advantage for the Euler scheme compared to simply using (6.40). Indeed, if 
we apply the Euler scheme to log D(t) rather than D(t), we recover (6.40) 


exactly. 
But we do find a difference when we apply a second-order discretization. 
The simplified second-order scheme for a generic bivariate diffusion X driven 


by a scalar Brownian motion has the form 
i+ DN 1 [La \ 2 
ae i i) = Euler terms + 3 hes h 
1 fe 1 oe fe 0 fe 
os F ae) (AW? — h) + 1 + a,(X(i)) +L ere AWh 


£1bo(X (i) Lhaz(X (i)) + L°bo(X (i) 
with 
O =a tae +} (voy 2009 2 - T Z) 
Ox Oro 2 \ lər? Ox ðr? 
and 9 3 
LS res + baa. - 


When specialized to the bond-pricing setting, this discretizes r(t) as 


Ai +1) =A(i) + uh +0oAW + ¿oo'[AW? — h] 
+ 4(op + po! + 4070" )AWh + (u'u + tou" )h? 


with u, o and their derivatives on the right evaluated at 7(z). This is exactly 
the same as the scheme for r(t) alone. But in discretizing D(t), we get 


LY(-r(t)D(t)) = —o(r(t)) D(t) 


and 
£°(—r(t)D(t)) = -e(r ) D(t) + r(t)? D(A). 


Hence, the scheme becomes 
Di +1) = Di) (1 — F(A + 4P0P — uA — tolli) AWR), 


which involves terms not reflected in (6.40). 
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Again because of the smoothness of the coefficients for D(t), this method 
has weak order 2 if the scheme for r(t) itself achieves this order. Thus, the 
weak error criterion extends to the bond price (6.39), though it would not 
necessarily extend if we applied the crude discretization in (6.40). 

The same idea clearly applies in computing a function of, e.g., 


(xo f xo a | 


as might be required in pricing an Asian option. In contrast, incorporating 
path-dependence through the maximum or minimum of a process (to price 
a barrier or lookback option, for example) is more delicate. We can define a 
supplementary variable of the form 
M(t) = os X(t) 

to remove dependence on the past of X, but the method applied above with 
the discount factor D(t) does not extend to the bivariate process (X (t), M (t)). 
The difficulty lies in the fact that the running maximum M (t) does not satisfy 
an SDE with smooth coefficients. For example, M remains constant except 
when X(t) = M(t). Asmussen, Glynn, and Pitman [24] show that even when 
X is ordinary Brownian motion (so that the Euler scheme for X is exact), 
the Euler scheme for M has weak order 1/2 rather than the weak order 1 
associated with smooth coefficient functions. We return to the problem of 
discretizing the running maximum in Section 6.4. 


6.2.4 Extrapolation 


An alternative approach to achieving second-order accuracy applies Richard- 
son extrapolation (also called Romberg extrapolation) to two estimates ob- 
tained from a first-order scheme at two different levels of discretization. This is 
easier to implement than a second-order scheme and usually achieves roughly 
the same accuracy — sometimes better, sometimes worse. The same idea can 
(under appropriate conditions) boost the order of convergence of a second- 
- order or even higher-order scheme, but these extensions are not as effective in 
practice. 

To emphasize the magnitude of the time increment, we write X} for a 
discretized process with step size h. We write X A(T) for the state of the 
discretized process at time T; more explicitly, this is X"(|T’/A|h). 

As discussed in Section 6.1.2, the Euler scheme often has weak order 1, in 
which case i 

ELF (R*T))] — E[f(X(T))]| < Ch (6.42) 
for some constant C, for all sufficiently small A, for suitable f. Talay and 
Tubaro [343], Bally and Talay [34], and Protter and Talay [301] prove that 
the bound in (6.42) can often be strengthened to an equality of the form 
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E[f(x"(T))] = E[f(X(L))] + ch + o(h), 


for some constant c depending on f. In this case, the discretization with time 
step 2h satisfies 


E[f(X?"(L))] = E[f(X(L))] + 2ch + ofA), 


with the same constant c. 

By combining the approximations with time steps h and 2h, we can elimi- 
nate the leading error term. More explicitly, from the previous two equations 
we get : i 

2E[F(X*(T))] — ELf(x?"(T))] = Elf(X(L))] + o(h). (6.43) 
This suggests the following algorithm: simulate with time step h to estimate 
E[f(X"(T))]; simulate with time step 2h to estimate E[f(X?"(T))]; double 
the first estimate and subtract the second to estimate E|f(X(T))]. The bias 
in this combined estimate is of smaller order than the bias in either of its two 
components. 

Talay and Tubaro [343] and Protter and Talay [301] give conditions under 
which the o(h) term in (6.43) is actually O(h?) (and indeed under which the 
error can be expanded in arbitrarily high powers of h). This means that apply- 
ing extrapolation to the Euler scheme produces an estimate with weak order 
2. Because the Euler scheme is easy to implement, this offers an attractive 
alternative to the second-order schemes derived in Sections 6.2.1 and 6.2.2. 

The variance of the extrapolated estimate is typically reduced if we use 
consistent Brownian increments in simulating paths of X} and X?}. Each 
Brownian increment driving X?" is the sum of two of the increments driving 
X}. If we use VAZi, VhZ,... as Brownian increments for X} we should use 
Vh(Z, + Z2), Vh(Z3 + Z4),... as Brownian increments for X?*. Whether or 
not we use this construction (as opposed to, e.g., simulating the two inde- 
pendently) has no bearing on the validity of (6.43) because (6.43) refers only 
to expectations and is unaffected by any dependence between X” and XA 
Observe, however, that 


Var |2°(X*(T)) — F(R” T))| = Var | (XM(T))] + Var [FR”T))] 
—4Cov | f(X*(L)), SRT). 


Making f(X"(T)) and f(X2"(T)) positively correlated will therefore reduce 
variance, even though it has no effect on discretization bias. Using consistent 
Brownian increments will not always produce positive correlation, but it often 
will. Positive correlation can be guaranteed through monotonicity conditions, 
for example. This issue is closely related to the effectiveness of antithetic 
sampling; see Section 4.2, especially the discussion surrounding (4.29). 
Extrapolation can theoretically be applied to a second-order scheme to 
further increase the order of convergence. Suppose that we start from a scheme 
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having weak order 2, such as the simplified scheme (6.36) or (6.38). Suppose 
that in fact 
E[f(X"(L))] = E[F(X(T))] + ch? + off’), 
Then 
{(4E[f(X"(7))] ~ ER” T) 
= F(L4ELF(X(T))] + Ach? + o(h?)} — {ELP(X(T))] + Ach? + o(h?)}) 
= E[f(X(T))] + o(h*). 


If the o(h”) error is in fact O(h?), then the combination 


= (B2ELf(X"(L))] — 12s (R™T))] + ERAT) 


eliminates.that term too. Notice that the correct weights to apply to KOE. 
and any other discretization depend on the weak order of convergence of the 


scheme used. 


6.3 Pensions 


6.3.1 General Expansions 


The derivations leading to the strong first-order scheme (6.10) and the weak 
second-order schemes (6.28) and (6.32) generalize to produce approximations 
that are, in theory, of arbitrarily high weak or strong order, under conditions 
on the coefficient functions. These higher-order methods can be cumbersome 
to implement and are of questionable practical significance; but they are of 
considerable theoretical intérest and help underscore a distinction between 
weak and strong approximations. 

We consider a d-dimensional process X driven by an m-dimensional stan- 
dard Brownian motion W through an SDE of the form 


d 
dX(t) = bo(X(t)) dt + X d;(X(t))' aW (t). 


j=l 


We have written the drift coefficient as bp rather than a to allow more compact 
notation in the expansions that follow. Let £? be as in (6.29) but with a; 
replaced by bo; and let £* be as in (6.30), k =1,...,m. 

For any n = 1,2,..., and any j1, j2,...,fn € {0,1,...,m}, define the 
multiple integrals 


t+h U3 U2 
T EE E =| af | dW; (u1) dW;, (u2) ---dW;,, (Un), 
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with the convention that dWo(u) = du. These integrals generalize those used 
in Section 6.2.2. Here, t and h are arbitrary positive numbers; as in Sec- 
tion 6.2.2, our objective is to approximate an arbitrary increment from X(t) 


to X(t +h). 
The general weak expansion of order 8 = 1,2,... takes the form 
f : x 
X(t+h)eXBH+S > SO Ch L bjn Tin) (6.44) 
n=] j1,..5Jn 
with each j; ranging over 0,1,...,m. (This approximation applies to each 


coordinate of the vectors X and b;,.) When 8 = 2, this reduces to the second- 
order scheme in (6.32). Kloeden and Platen [211] (Section 14.5) justify the 
general case and provide conditions under which this approximation produces 
a scheme with weak order Ø. 

In contrast to (6.44), the general strong expansion of order G = 1/2, 1, 
3/2,... takes the form (Kloeden and Platen [211], Section 14.5) 


X(t+h a XO+ Yo LL 1b Tein): (6.45) 
(J1)--JnJEAg 


The set Ag is defined as follows. A vector of indices (71,...,jn) is in Ag if 
either (i) the number of indices n plus the number of indices that are 0 is less 
than or equal to 2, or (ii) n = 8 + 5 and all n indices are 0. Thus, when 
2 = 1, the weak expansion sums over j = 0 and j = 1 (the Euler scheme), 
whereas the strong expansion sums over j = 0, 7 = 1, and (j1, j2) = (1,1) to 
get (6.10). Kloeden and Platen [211], Section 10.6, show that (6.45) indeed 
results in an approximation with strong order ĝ. 

Both expansions (6.44) and (6.45) follow from repeated application of the 
steps we used in (6.7) and (6.22)—(6.23). The distinction between the two 
expansions can be summarized as follows: the weak expansion treats terms 
AW;, 7 # 0, as having the same order as h, whereas the strong expansion 
treats them as having order h!/*. Thus, indices equal to zero (corresponding 
to “dt” terms rather than “dW,” terms) count double in reckoning the number 
of terms to include in the strong expansion (6.45). 


6.3.2 Jump-Diffusion Processes 


Let {N(t),t > 0} be a Poisson process, let {Y1, Y2,...} be ii.d. random 
vectors, W a standard multidimensional Brownian motion with N, W, and 
{Y1, Yo,...} independent of each other. Consider jump-diffusion models of the 


form 
dX(t) = a(X (t—)) dt+b(X(t—))! dW (t)+c(X (t—), Yn (t—)+1) dN (t). (6.46) 


Between jumps of the Poission process, X evolves like a diffusion with coeffi- 
cient functions a and b; at the nth jump of the Poisson process, the jump in 
X is 
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X(t). — X(t-) = c(X(t—-), Yn), 


a function of the state of X just before the jump and the random variable Yn. 
We discussed a special case of this model in Section 3.5.1. Various processes 
of this type are used to model the dynamics of underlying assets in pricing 
derivative securities. 

Mikulevicius and Platen [265] extend the general weak expansion (6.44) 
to processes of this type and analyze the discretization schemes that follow 
from this expansion. Their method uses a pure-diffusion discretization method 
between the jumps of the Poisson process and applies the function c to the dis- 
cretized process to determine the jump in the discretized process at a jump of 
the Poisson process. ‘The jump magnitudes are thus computed exactly, condi- 
tional on the value of the discretized process just before a jump. Mikulevicius 
and Platen [265] show, in fact, that the weak order of convergence of this 
method equals the order of the scheme used for the pure-diffusion part, under 
conditions on the coefficient functions a, b, and c. 

In more detail, this method supplements the original time grid 0, A, 2h,... 
with the jump times of the Poisson process. Because the Poisson process is 
independent of the Brownian motion, we can imagine generating all of these 
jump times at the start of a simulation. (See Section 3.5 for a discussion of 
the simulation of Poisson processes.) Let 0 = 79,71, 72,... be the combined 
time grid, including both the multiples of h and the Poisson jump times. The 
discretization scheme proceeds by simulating X from 7; to Tais e eSa 
Given X (Ti), we apply an Euler scheme or higher-order scheme to generate 
X (7;41—), using the coefficient functions a and b. If 741 is a Poisson jump 
time — the nth, say — we generate Y,, and set 


KGa a Gas (aa), 


If 741 is not a jump time, we set X (Ti41) = X(ti41—). 

Glasserman and Merener |145] apply this method to a version of the LI- 
BOR market model with jumps. The jump processes they consider are more 
general than Poisson processes, having arrival rates that depend on the cur- 
rent level of forward rates. They extend the method to this setting by using 
a bound on the state-dependent arrival rates to construct jumps by thinning 
a Poisson process. This requires relaxing the smoothness conditions imposed 
on c in Mikulevicius and Platen [265]. 

Maghsoodi [245] provides various alternative discretization schemes for 
jump-diffusion processes and considers both strong and weak error criteria. 
He distinguishes jump-adapted methods (like the one described above) that 
include the Poisson jump epochs in the time grid from those that use a fixed 
grid. A jump-adapted method may become computationally burdensome if the 
jump intensity is very high. Protter and Talay [301] analyze the Euler scheme 
for stochastic differential equations driven by Lévy processes, which include 
(6.46) as a special case. Among other results, they provide error expansions 
in powers of h justifying the use of Richardson extranolation. 
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6.3.3 Convergence of Mean Square Error 


The availability of discretization schemes of various orders poses a tradeoff: 
using a higher-order scheme requires more computing time per path and thus 
reduces the number of paths that can be completed in a fixed amount of time. 
The number of paths completed affects the standard error of any estimates we 
compute, but has no effect on discretization bias, which is determined by our 
choice of scheme. Thus, we face a tradeoff between reducing bias and reducing 
variance. 

We discussed this tradeoff in a more general setting in Section 1.1.3 and 
the asymptotic conclusions reached there apply in the current setting. Here 
we present a slightly different argument to arrive at the same conclusion. 

We suppose that our objective is to minimize mean square error (MSE), 
the sum of variance and squared bias. Using a discretization scheme of weak 


order 3, we expect 
Bias © c,h? 


for some constant cı. For the variance based on n paths we expect 
i C2 
Variance ~ — 
n 


for some constant cg. The time step h would generally have some effect on 
variance; think of c2 as the limit as h decreases to zero of the variance per 
replication. 

If we make the reasonable assumption that the computing time per path is 
proportional to the number of steps per path, then it is inversely proportional 
to h. The total computing time for n paths is then nc3/h, for some constant 
C3. 

With these assumptions and approximations, we formulate the problem of 
minimizing MSE subject to a computational budget s as follows: 
min (pn? + 2) subject to e 
h n h 


Tl, 


Using the constraint to eliminate a variable, we put this in the form 


: 2126 25) 
min {| cyh ——]}, 
h ( : ü hs 


which is minimized at 
h = cs B+ (6.47) 


with c a constant. Substituting this back into our expressions for the squared 
bias and the variance, we get 


_ 28 _ 28 = 
MSE 7 cis 26+1 + Cos Itt — c's 2BTT , 


for some constants c’, ci, ch. The optimal allocation thus balances variance 
and squared bias. Also, the optimal root mean square error becomes 
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B 
V MSE x s 2447, (6.48) 


This is what we found in Section 1.1.3 as well. 

These calculations show how the order 8 of a scheme affects both the opti- 
mal allocation of effort and the convergence rate under the optimal allocation. 
As the convergence order 3 increases, the optimal convergence rate in (6.48) 
approaches s~‘/?, the rate associated with unbiased simulation. But for the 
important cases of 8 = 1 (first-order) and 8 = 2 (second-order) we get rates of 
s-1/3 and s~?/>, This makes precise the notion that simulating a process for 
which a discretization scheme is necessary is harder than simulating a solv- 
able model. It also shows that when very accurate results are required (i.e., 
when s is large), a higher-order scheme will ultimately dominate a lower-order 
scheme. 

Duffie and Glynn [100] prove a limit theorem that justifies the convergence 
rate implied by (6.48). They also report numerical results that are generally 
consistent with their theoretical predictions. 


6.4 Extremes and Barrier Crossings: Brownian 
Interpolation 


In Section 6.2.3 we showed that discretization methods can sometimes be ex- 
tended to path-dependent payoffs through supplementary state variables. ‘The 
additional state variables remove dependence on the past; standard discretiza- 
tion procedures can then be applied to the augmented state vector. 

In option pricing applications, path-dependence often enters through the 
maximum or minimum of an underlying asset over the life of the option. This 
includes, for example, options whose payoffs depend on whether or not an un- 
derlying asset crosses a barrier. Here, too, path-dependence can be eliminated 
by including the running maximum or minimum in the state vector. However, 
this renders standard discretization procedures inapplicable because of the 
singular dynamics of these supplementary variables. The running maximum, 
for example, can increase only when it is equal to the underlying process. 

This issue arises even when the underlying process X is a standard Brown- 
ian motion. Let 


M(t) = max X(u) 


O<ux<t 


and let . 

M"(n) = max{X (0), X(h), X(2h),...,X(nh)}. (6.49) 
Then M*(n) is the maximum of the Euler approximation to X over [0, nh]; 
the Euler approximation to X is exact for X itself because X is Brownian 
motion. Fix a time T and let h = T/n so that M"(n) is the discrete-time 
approximation of M(T). Asmussen, Glynn, and Pitman [24] show that the 
normalized error 
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h~/?[M"(n) — M(T)] 


has a limiting distribution as h — 0. This result may be paraphrased as 
stating that the distribution of M"(n) converges to that of M(T) at rate 
h'/?, It follows that the weak order of convergence (in the sense of (6.14)) 
cannot be greater than 1/2. In contrast, we noted in Section 6.1.2 that for 
SDEs with smooth coefficient functions the Euler scheme has weak order of 
convergence 1. Thus, the singularity of the dynamics of the running maximum 
leads to a slower convergence rate. 

In the case of Brownian motion, this difficulty can be circumvented by 
sampling M(T) directly, rather than through (6.49). We can sample from 
the joint distribution of X(T) and M(T) as follows. First we generate X(T) 
from N(0,T'). Conditional on X(T) the process {X(t),0 < t < T} becomes a 
Brownian bridge, so we need to sample from the distribution of the maximum 
of a Brownian bridge. We discussed how to do this in Example 2.2.3. Given 
X(T), set 
X(T) + J/X(T)? — 2T log U 

2 
with U ~ Unif[0,1] independent of X(T). The pair (X(T), M(T)) then has 
the joint distribution of the terminal and maximum value of the Brownian 
motion over [0, T]. 

This procedure, exact for Brownian motion, suggests an approximation 
for more general processes. Suppose X is a diffusion satisfying the SDE (6.1) 
with scalar coefficient functions a and b. Let X (i) = X(ih), i = 0,1,..., be 
a discrete-time approximation to X, such as one defined through an Euler 
or higher-order scheme. The simple estimate (6.49) applied to X is equiva- 
lent to taking the maximum over a piecewise linear interpolation of X. We 
can expect to get a better approximation by interpolating over the interval 
lih, (i + 1)h) using a Brownian motion with fixed parameters a; = a(X(i)) 
and b; = b(X(i)). Given the endpoints X(i) and X((i+1)), the maximum of 
the interpolating Brownian bridge can be simulated using 


o KUFIA + \/[X G4 1) — X(i)]? — 202A log U; sass 
i= te ee 5 . 


with Uo, U1, .. . independent Unif[0,1] random variables. (The value of a(X (i) 
becomes immaterial once we condition on X(i+1).) The maximum of X over 
[0, T] can then be approximated using 


max{ Mo, Mi, ae Mii} 


Similar ideas are suggested in Andersen and Brotherton-Ratcliffe [16] and in 
Beaglehole, Dybvig, and Zhou [41] for pricing lookback options; their numeri- 
cal results indicate that the approach can be very effective. Baldi [31] analyzes 
related techniques in a much more general setting. 
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In some applications, X may be better approximated by geometric Brown- 
ian motion than by ordinary Brownian motion. This can be accommodated 
by applying (6.50) to log X rather than X. This yields 


og My = PERE DAO) + vlog XE + D/AO)P = 26s/XO)Ph log Ui 
a 5 . 


and exponentiating produces M;. 


Barrier Crossings 


Similar ideas apply in pricing barrier options with continuously monitored 
barriers. Suppose B > X (0) and let 


r = inf{t > 0: X(t) > B} 
A knock-out option might have a payoff of the form 
(K — X(T) Hr > T}, (6.51) 


with K a constant. This requires simulation of X(T) and the indicator 1{7 > 
T}. 
The simplest method sets 


# =inf{i: Ñ (i) > B} 


and approximates (X(T), 1{r > T}) by (X(n), 1{7 > n}) with h = T/n, for 
some discretizaton X. But even if we could simulate X exactly on the discrete 
grid 0, h, 2h,..., this would not sample 1{7 > T} exactly: it is possible for X 
to cross the barrier at some time t between grid points ih and (i + 1)h and 
never be above the barrier-at any of the dates 0, h, 2h,.... 

The method in (6.50) can be used to reduce discretization error in sampling 
the survival indicator 1{r > T}. Observe that the barrier is crossed in the 
interval fih, (i + 1)h) precisely if the maximum over this interval exceeds B. 
Hence, we can approximate the survival indicator 1{r > T} using 


T 1{M; < B}, (6.52) 


with nh = T and M; as in (6.50). 

This method can be simplified. Rather than generate M;, we can sample 
the indicators 1{M; < B} directly. Given (i) and X(i + 1), this indicator 
takes the value 1 with probability 


2(B — X())(B meee 2) | 


D= P(M; < BX (i), X(i +1))=1-—exp [ KOR 
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(assuming B is greater than both X(i) and X(i+1)) and it takes the value 
0 with probability 1 — p;. Thus, we can approximate 1{7 > T} using 


n—Il 
IT 1{U; < pi}. 


1=0 


For fixed Up, U1,...,Un—1, this has the same value as (6.52) but is slightly sim- 
pler to evaluate. The probabilities p; could alternatively be computed based 
on an approximating geometric (rather than ordinary) Brownian motion. 

The discretized process X is often a Markov process and this leads to 
further simplification. Consider, for example, the payoff in (6.51). Using (6.52), 
we approximate the payoff as 


(K —X(n))* TP < By. (6.53) 
1=0 


The conditional expectation of this expression given the values of X is 


E | (K — X(n))* ii 1{M;i < B}|X(0), X(1),...,X(n)} (6.54) 
1=0 
= (K — X(n))* I E[1{M; < B}|X(i), X (i +1) 


= (K ~ È (n))* [J Pe (6.55) 
t=0 


Thus, rather than generate the barrier-crossing indicators, we can just multi- 
ply by the probabilities p;. 

Because (6.55) is the conditional expectation of (6.53), the two have the 
same expectation and thus the same discretization bias. By Jensen’s inequal- 
ity, the second moment of (6.53) is larger than the second moment of its condi- 
tional expectation (6.55), so using (6.55) rather than (6.53) reduces variance. 
(This is an instance of a more general strategy for reducing variance known 
as conditional Monte Carlo, based on replacing an estimator with its con- 
ditional expectation; see, Boyle et al. [53] for other applications in finance.) 
Using (6.53), we would stop simulating a path once some M, exceeds B. Using 
(6.55), we never generate the M; and must therefore simulate every path for 
n steps, unless some X (i) exceeds B (in which case p; = 0). So, although 
(6.55) has lower variance, it requires greater computational effort per path. 
A closely related tradeoff is investigated by Glasserman and Staum [146]; 
they consider estimators in which each transition of an underlying asset is 
sampled conditional on not crossing a barrier. In their setting, products of 
survival probabilities like those in (6.55) serve as likelihood ratios relating the 
conditional and unconditional evolution of the process. 
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Baldi, Caramellino, and Iovino [32] develop methods for reducing dis- 
cretization error in a general class of barrier option simulation problems. They 
consider single- and double-barrier options with time-varying barriers and de- 
velop approximations to the one-step survival probabilities that refine the p; 
above. The p; are based on a single constant barrier and a Brownian approxi- 
mation over a time interval of length h. Baldi et al. [32] derive asymptotics of 
the survival probabilities as h — 0 for quite general diffusions based, in part, 
on a linear approximation to upper and lower barriers. 


Averages Revisited 


As already noted, the simulation estimators based on (6.50) or (6.52) can be 
viewed as the result of using Brownian motion to interpolate between the 
points X(i) and (i + 1) in a discretization scheme. The same idea can be 
applied in simulating other path-dependent quantities besides extremes and 
barrier-crossing indicators. 

As an example, consider simulation of the pair 


[xen J E7 a 


for some scalar diffusion X. In Section 6.2.3, we suggested treating this pair 
as the state at time T of a bivariate diffusion and applying a discretization 
method to this augmented process. An alternative simulates a discretization 
xX (i), 7 = 0,1,...,n, and uses Brownian interpolation to approximate the 
integral. More explicitly, the approximation is 


with b; = b(X (i)) and each A; sampled from the distribution of 


t+h 
W(u)du, W ~ BM(0,1), 


t 


conditional on W(t) = X (i) and W(t + h) = (i + 1). The calculations used 
to derive (6. 7, show that this conditional distribution is normal with mean 
h(Š(i+1)+ X(i))/2 and variance h?/3. Thus, the A; are easily generated. 
This leads to a discretization scheme only slightly different from the one ar- 
rived at through the approach in Section 6.2.3. Because of the relative smooth- 
ness of the running integral, the effect of Brownian interpolation in this setting 
is minor compared to the benefit in simulating extremes or barrier crossings. 
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6.5 Changing Variables 


We conclude our discussion of discretization methods by considering the flexi- 
bility to change variables through invertible transformations of a process. If X 
is a d-dimensional diffusion and g : Rt — R? is a smooth, invertible transfor- 
mation, we can define a process Y(t) = g(X(t)), simulate a discretization Y, 
and define X = g7l(Y) to get a discretization of the original process X. Thus, 
even if we restrict ourselves to a particular discretization method (an Euler or 
higher-order scheme), we have a great deal of flexibility in how we implement 
it. Changing variables has the potential to reduce bias and can also be useful 
in enforcing restrictions (such as nonnegativity) on simulated values. There 
is little theory available to guide such transformations; we illustrate the idea 
with some examples. 


Taking Logarithms 


Many of the stochastic processes that arise in mathematical finance take only 
positive values. This property often results from specifying that the diffu- 
sion term be proportional to the current level of the process, as in geo- 
metric Brownian motion and in the LIBOR market model of Section 3.7. 
If the coordinates of a d-dimensional process X are positive, we may define 
Y;(t) = log X;(t), i = 1,...,d, apply Itô’s formula to derive an SDE satisfied 
by Y = (¥1,..., Ya), simulate a discretization Y of Y, and then (if they are 
needed) approximate the original X; with X; = exp(Yj). We encountered this 
idea in Section 3.7.3 in the setting of the LIBOR market model. 

Applying a logarithmic transformation can have several benefits. First, it 
ensures that the simulated X; are positive because they result from exponen- 
tiation, whereas even a high-order scheme applied directly to the dynamics of 
X will produce some negative values. Keeping the variables positive can be 
important if the variables represent asset prices or interest rates. 

Second, a logarithmic transformation can enhance the numerical stability 
of a discretization method, meaning that it can reduce the propagation of 
round-off error. A process with “additive noise” can generally be simulated 
with less numerical error than a process with “multiplicative noise.” Numerical 
stability is discussed in greater detail in Kloeden and Platen [211]. 

Third, a logarithmic transformation can reduce discretization bias. For 
example, an Euler scheme applied to geometric Brownian motion becomes 
exact if we first take logarithms. More generally, if the coefficients of a diffusion 
are nearly proportional to the level of the process, then the coefficitions of the 
log process are nearly constant. 

This idea is illustrated in Figure 6.3, which is similar to examples in 
Glasserman and Merener [145] and is based on numerical results obtained 
by Nicolas Merener. The figure shows estimated biases in pricing a six-month 
caplet maturing in 20 years using the LIBOR market model with the decreas- 
ing volatility parameters used in Table 4.2. The largest practical time step in 
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this setting is the length of the accrual period; the figure compares methods 
using one, two, and four steps per accrual period with between four and 20 
million replications per method. In this example, taking logarithms cuts the 
absolute bias roughly in half for the Euler scheme. The figure also shows re- 
sults using a second-order method for rates (x) and log rates (o); even with 
10 million replications, the mean errors in these methods are not statistically 
distinguishable from zero. In a LIBOR market model with jumps, Glasserman 
and Merener [145] find experimentally that a first-order scheme applied to log 
rates is as accurate as a second-order scheme applied to the rates themselves. 
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Fig. 6.3. Estimated bias versus number of steps in caplet pricing. The x and o 
correspond to second-order schemes for rates and log rates, respectively. 


As an aside, we note that the most time-consuming part of simulating 
an Euler approximation to a LIBOR market model is evaluating the drift 
coefficient. ‘To save computing time, one might use the same value of the drift 
for multiple time steps, along the lines of Example 4.1.4, but with occasional 
updating of the drift. A similar idea is found to be effective in Hunter et 
al. [192] as part of a predictor-corrector method (of the type discussed in 


Chapter 15 of Kloeden and Platen [211]). 


Imposing Upper and Lower Bounds 


Taking logarithms enforces a lower bound at zero on a discretized process. 
Suppose the coordinates of X are known to evolve in the interval (0,1). How 
could this be imposed on a discretization of X? Enforcing such a condition 
could be important if, for example, the coordinates of X are zero-coupon 
bond prices, which should always be positive and should never exceed their 


face value of 1. 
Glasserman and Wang [149] consider the transformations 
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Xi 
Y,=@7'(X;) and Y; =log ( ) (6.56) 


both of which are increasing functions from (0,1) onto the real line. Itô’s 
formula produces an SDE for Y from an SDE for X in either case; the result- 
ing SDE can be discretized and then the inverse transformations applied to 


produce 


; , . Ý; 
X; = ®(Y;) or Xi = eN 
1+ exp(Y;) 


If the coordinates of X correspond to bonds of increasing maturities, we 
may also want to enforce an ordering of the form 


12240) S260) SS 4) 1 (6.57) 


in the discretization. The method in [149] accomplishes this by defining Yı = 
g(X1), Yı = 9 ( Xi /Xi_-1), i = 2,...,d, with g an increasing function from (0, 1) 
to R as in (6.56). It’s formula gives the dynamics of (Yi,..., Ya) and then a 
discretization (Y1,..., Yq); applying the inverse of g produces a discretization 
(X1,...,Xq) satsifying (6.57). 


Constant Diffusion Transformation 


Consider the SDE (6.1) in the case of scalar X and W. Suppose there exists 
an invertible, twice continuously differentiable transformation g : R — R for 
which g'(x) = 1/b(z). With Y(t) = g(X(t)), It6’s formula gives 


dY (t) = [a(X (t))g’(X(t)) + 50°(X(t))g"(X(t))] dt + g/(X(t))bo(X (A) dW (Et) 
= a(Y(t)) dt + dW(t), 


with a(y) = a(f(y))9'(f(y))+56°(F(y))9" (f(y) and f the inverse of g. Chang- 
ing variables from X to Y thus produces an SDE with a constant diffusion 


coefficient. ‘This is potentially useful in reducing discretization error, though 
the impact on the drift cannot be disregarded. Moving all state-dependence 
from the diffusion to the drift is attractive in combination with a predictor- 
corrector method, which improves accuracy by averaging current and future 
levels of the drift coefficient. 

To illustrate the constant diffusion transformation we apply it to the 
square-root diffusion 


dX (t) = a(é — X(t)) dt +a X(t)dW (t). 


Let Y(t) = 2VX/o; then Y is a Bessel-like process, 


dY (t) = | (Mo ea; z 240) dt + dW (8). 
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In Section 3.4 we imposed the condition 2az > a7, and this is precisely the 
condition required to ensure that Y never reaches zero if Y(0) > 0. 

Now suppose X and W take values in R? and suppose the d x d matrix 
b(x) has inverse c(x). Ait-Sahalia [7] shows that there is an invertible trans- 
formation g : R — R? such that the diffusion matrix of Y(t) = g(X(t)) is 
the identity if and only if 


OCij = OCik 
Ox, Ox; 
for all ¿i,j,k = 1,...,d. This condition implies the commutativity condition 


(6.33). It follows that processes X that can be transformed to have a con- 
stant diffusion matrix are also processes for which a second-order scheme is 


comparatively easy to implement. 


Martingale Discretization 


The requirement that discounted asset prices be martingales is central to the 
pricing of derivative securities. The martingale property is usually imposed in 
a continuous-time model; but as we have noted at several points, it is desirable 
to enforce the property on the simulated approximation as well. Doing so 
extends the no-arbitrage property to prices computed from a simulation. It 
also preserves the internal consistency of a simulated model by ensuring that 
prices of underlying assets computed in a simulation coincide with those used 
as inputs to the simulation; cf. the discussions in Sections 3.3.2, 3.6.2, 3.7.3, 
Example 4.1.3, and Section 4.5. 

If the SDE (6.1) has drift coefficient a identically equal to zero, then the 
Euler and higher-order methods in Sections 6.1.1 and 6.2 all produce dis- 
cretizations X that are discrete-time martingales. In this sense, the martingale 
property is almost trivially preserved by the discretization methods. However, 
the variables simulated do not always coincide with the variables to which the 
martingale property applies. For example, in pricing fixed-income derivatives 
we often simulate interest rates but the martingale property applies to dis- 
counted bonds and not to the interest rates themselves. 

The simulation method developed for the Heath-Jarrow-Morton frame- 
work in Section 3.6.2 may be viewed as a nonstandard Euler scheme — non- 
standard because of the modified drift coefficient. We modified the drift in 
forward rates precisely so that the discretized discounted bond prices would 
be martingales. In the LIBOR market models of Section 3.7, we argued that 
an analogous drift modification is infeasible. Instead, in Section 3.7.3 we dis- 
cussed ways of preserving the martingale property through changes of vari- 
ables. These methods include discretizing the discounted bond prices directly 
or discretizing their differences. 

Preserving the martingale property is sometimes at odds with preserving 
bounds on variables. If X is a positive martingale then the Euler approx- 
imation to X is a martingale but not, in general, a positive process. Both 
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properties can usually be enforced by discretizing log X instead. If X is a 
martingale taking values in (0,1), then defining Y as in (6.56), discretizing 
Y, and then inverting the transformation preserves the bounds on X but not 
the martingale property. Using the transformation ®~', a simple correction 
to the drift preserves the martingale property as well as the bounds. As shown 
in [149], when applied to the multidimensional constraint (6.57), this method 
preserves the constraints and limits departures from the martingale property 
to terms that are o(h), with h the simulation time step. 


6.6 Concluding Remarks 


The benchmark method for reducing discretization error is the combination of 
the Euler scheme with two-point extrapolation, as in Section 6.2.4. This tech- 
nique is easy to implement and is usually faster than a second-order expansion 
for comparable accuracy, especially in high dimensions. The smoothness con- 
ditions required on the coefficient functions to ensure the validity of extrapo- 
lation are not always satisfied in financial engineering applications, but similar 
conditions underlie the theoretical support for second-order approximations. 

Second-order and higher-order schemes do have practical applications, but 
they should always be compared with the simpler alternative of an extrapo- 
lated Euler approximation. A theoretical comparison is usually difficult, but 
the magnitudes of the derivatives of the coefficient functions can provide some 
indication, with large derivatives unfavorable for higher-order methods. Ta- 
lay and Tubaro [343] analyze specific models for which they derive explicit 
expressions for error terms. They show, for example, that the refinement in 
(6.10) can produce larger errors than an Euler approximation. 

Of the many expansions discussed in this chapter, the most useful (in 
addition to the extrapolated Euler scheme) are (6.28) for scalar processes and 
(6.38) for vector processes. 

For specific applications in financial engineering, a technique that uses 
information about the problem context is often more effective than a general- 
purpose expansion. For example, even a simple logarithmic change of vari- 
ables can reduce discretization error, with the added benefit of ensuring that 
positive variables stay positive. The methods of Section 6.4 provide another 
illustration: they offer simple and effective ways of reducing discretization er- 
ror in pricing options sensitive to barrier crossings and extreme values of the 
underlying assets. The simulation methods in Sections 3.6 and 3.7 also address 
discretization issues for specific models. 

The literature on discretization methods for stochastic differential equa- 
tions offers many modifications of the basic schemes discussed in Sections 6.1.1, 
6.2, and 6.3.1. Kloeden and Platen [211] provide a comprehensive treatment 
of these methods. Another direction of research investigates the law of the 
error of a discretization method, as in Jacod and Protter [193]. 


376 6 Discretization Methods 


This chapter has discussed the use of second-order and higher-order expan- 
sions in simulation, but the same techniques are sometimes useful in deriving 
approximations without recourse to simulation. 


7 


Estimating Sensitivities 


Previous chapters have addressed various aspects of estimating expectations 
with a view toward computing the prices of derivative securities. This chapter 
develops methods for estimating sensitivities of expectations, in particular 
the derivatives of derivative prices commonly referred to as “Greeks.” From 
the discussion in Section 1.2.1, we know that in an idealized setting of con- 
tinuous trading in a complete market, the payoff of a contingent claim can 
be manufactured (or hedged) through trading in underlying assets. The risk 
in a short position in an option, for example, is offset by a delta-hedging 
strategy of holding delta units of each underlying asset, where delta is simply 
the partial derivative of the option price with respect to the current price of 
that underlying asset. Implementation of the strategy requires knowledge of 
these price sensitivities; sensitivities with respect to other parameters are also 
widely used to measure and manage risk. Whereas the prices themselves can 
often be observed in the market, their sensitivites cannot, so accurate calcula- 
tion of sensitivities is arguably even more important than calculation of prices. 
We will see, however, that derivative estimation presents both theoretical and 
practical challenges to Monte Carlo simulation. 

The methods for estimating sensitivities discussed in this chapter fall into 
two broad categories: methods that involve simulating at two or more values 
of the parameter of differentiation and methods that do not. The first cate- 
gory — finite-difference approximations — are at least superficially easier to 
understand and implement; but because they produce biased estimates their 
use requires balancing bias and variance. Methods in the second category, 
when applicable, produce unbiased estimates. ‘They accomplish this by using 
information about the simulated stochastic process to replace numerical dif- 
ferentiation with exact calculations. The pathwise method differentiates each 
simulated outcome with respect to the parameter of interest; the likelihood 
ratio method differentiates a probability density rather than an outcome. 
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7.1 Finite-Difference Approximations 


Consider a model that depends on a parameter 0 ranging over some interval 
of the real line. Suppose that for each value of 9 we have a mechanism for 
generating a random variable Y(@), representing the output of the model at 
parameter @. Let 

a(@) = E[Y(6)]. 
The derivative estimation problem consists of finding a way to estimate a’ (0), 
the derivative of a with respect to @. 

In the application to option pricing, Y(@) is the discounted payoff of an 
option, a(@) is its price, and 0 could be any of the many model or market 
parameters that influence the price. When @ is the initial price of an underlying 
asset, then a’(@) is the option’s delta (with respect to that asset). The second 
derivative a’’(#) is the option’s gamma. When @ is a volatility parameter, 
a’(@) is often called “vega.” For interest rate derivatives, the sensitivities of 
prices to the initial term structure (as represented by, e.g., a yield curve or 
forward curve) are important. 


7.1.1 Bias and Variance 


An obvious approach to derivative estimation proceeds as follows. Simulate 
independent replications Y; (0),...,Yn(0) of the model at parameter 0 and 
n additional replications Y1 (0 + h),...,Y¥n(@ +h) at 0 + h, for some h > 0. 
Average each set of replications to get Y,(0) and Y,(@+ h) and form the 
forward-difference estimator 


6+ h) —Y,(6) 


7 : A 
Ar = Ap(n,h) = ( i (7.1) 
This estimator has expectation 
E[Ar] =h~*[a(6 +h) — a(@)}. (7.2) 


We have not specified what relation, if any, holds between the outcomes Y;(0) 
and Y;(@ + h); this will be important when we consider variance, but (7.2) 
is purely a property of the marginal distributions of the outcomes at 0 and 


O+h. 
If œ is twice differentiable at 8, then 


a(0 +h) = a(0) + a'(0)h + 3a” (0)h? + ofh?). 


In this case, it follows from (7.2) that the bias in the forward-difference esti- 


mator 1s 


Bias(Ar) = E[Ar — a/(0)] = 4a” (0)h + olh). (7.3) 


By simulating at 0—h and 0+h, we can form a central-difference estimator 
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¥n(O +h) — ¥n(@ =h) (7.4) 
2h 

This estimator is often more costly than the forward-difference estimator in 
the following sense. If we are ultimately interested in estimating a(@) and 
a’ (0), then we would ordinarily simulate at 0 to estimate a(@), so the forward- 
difference estimator requires simulating at just one additional point 0 + h, 
whereas the central-difference estimator requires simulating at two additional 
points. But this additional computational effort yields an improvement in 
the convergence rate of the bias. If a is at least twice differentiable in a 
neighborhood of 6, then 


a(6+h) = a(0) +a (8)h + a” (0)h? /2 + o(h”) 
a(0 — h) = a(0) — a’ (A)h + a” (0)h?/2 + o(h*), 


so subtraction eliminates the second-order terms, leaving 


Bias(Ac) = e a — œ (0) = o(h), (7.5) 
which is of smaller order than (7.3). If a” is itself differentiable at 6, we can 
refine (7.5) to 

Bias(Âc) = ta” (0)h? + o(h?). (7.6) 

The superior accuracy of a central-difference approximation compared 
with a forward-difference approximation is illustrated in Figure 7.1. The curve 
in the figure plots the Black-Scholes formula against the price of the underly- 
ing asset with volatility 0.30, an interest rate of 5%, a strike price of 100, and 
0.04 years (about two weeks) to expiration. The figure compares the tangent 
line at 95 with a forward difference calculated from prices at 95 and 100 and a 
central difference using prices at 90 and 100. The slope of the central-difference 
line is clearly much closer to that of the tangent line. 

In numerical differentiation of functions evaluated through deterministic 
algorithms, rounding errors resulting from small values of h limit the accuracy 
of finite-difference approximations. (See, for example, the discussion in Sec- 
tion 5.7 of Press et al. [299].) In applications of Monte Carlo, the variability in 
estimates of function values usually prevents us from taking very small values 
of h. So while it is advisable to be aware of possible round-off errors, this is 
seldom the main obstacle to accurate estimation of derivatives in simulation. 


Variance 


The form of the bias of the forward- and central-difference estimators would 
lead us to take ever smaller values of h to improve accuracy, at least if we 
ignore the limits of machine precision. But the effect of h on bias must be 
weighed against its effect on variance. 

The variance of the forward-difference estimator (7.1) is 
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Black-Scholes Option Price 


100 


95 
Underlying Asset Price 


Fig. 7.1. Comparison of central-difference approximation (dashed) and forward- 
difference approximation (dotted) with exact tangent to the Black-Scholes formula. 


Var[Ar(n, h)] = h-?Var [Yn (0 + h) — Yn (0) (7.7) 


and a corresponding expression holds for the central-difference estimator (7.4). 
In both cases, the factor h~? alerts us to possibly disastrous consequences of 
taking h to be very small. Equation (7.7) also makes clear that the dependence 
between values simulated at different values of 0 affects the variance of a finite- 
difference estimator. 

Suppose, for simplicity, that the pairs (Y (0), Y (0 + h)) and (Y;(0), Y;(0 + 
h)), í =1,2,..., are iid. so that 


Var [Yn(0 + h) — Ya(0)] = “Var Y (0 +h) - Y (0). 


How the variance in (7.7) changes with h is determined by the dependence of 
Var|Y (0+ h)—Y(0)| on h. 
Three cases of primary importance arise in practice: 
O(1), Case (i) 
Var[Y (6 + h) — Y(@)] =< O(h), Case (ii) (7.8) 
O(h7), Case (iii). 


Case (i) applies if we simulate Y (0) and Y(@ + h) independently; for in this 
case we have 


Var[Y (0 + h) — Y(@)] = Var[Y (0 + h)] + Var[Y (0)] — 2Var[Y (6)], 


under the minor assumption that Var[Y (@)] is continuous in 6. Case (ii) is the 
typical consequence of simulating Y (0 + h) and Y(@) using common random 


numbers; i.e., generating them from the same sequence Uj, U2,... of Unif[0,1] 
random variahles (Tn nractice this ig aceeamniiched hv ctartine the cimulatiang 
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at 0 and 0 + h with the same seed for the random number generator.) For 
Case (iii) to hold, we generally need not only that Y (0) and Y (0 + h) use the 
same random numbers, but also that for (almost) all values of the random 
numbers, the output Y(-) is continuous in the input 0. We will have much 
more to say about Case (iii) in Section 7.2.2. 


7.1.2 Optimal Mean Square Error 


Because decreasing h can increase variance while decreasing bias, minimiz- 
ing mean square error (MSE) requires balancing these two considerations. 
Increasing the number of replications n decreases variance with no effect on 
bias, whereas A affects both bias and variance; our objective is to find the op- 
timal relation between the two. This tradeoff is analyzed by Glynn [153], Fox 
and Glynn [128], and Zazanis [358], and in early work by Frolov and Chentsov 
[130]. 

Consider the forward-difference estimator with independent simulation at 
0 and 0 + h. We can reasonably assume that Case (i) of (7.8) applies, and for 
emphasis we denote this estimator by A Fi = Â F i(n, h). Squaring the bias in 
(7.3) and adding it to the variance in (7.7), we get 


MSE(Apr,(n, h)) = O(h?) + O(n IR”), 


from which we see that minimal conditions for convergence are h — 0 and 
nh? — ov. 

To derive a more precise conclusion, we strengthen Cases (i) and (ii) of 
(7.8). We have four estimators to consider: forward-difference and central- 
difference using independent sampling or common random numbers at differ- 
ent values of 0. We can give a unified treatment of these cases by considering 
a generic estimator A = A(n, h) for which 

2 
EA - a'(6)] = bh? + off), VarfA]=—— + 0(h-"), (7.9) 
for some positive 3, 7, c, and some nonzero b. Forward- and central-difference 
estimators typically have 8 = 1 and 8 = 2; taking 7 = 2 sharpens Case (i) of 
(7.8) and taking 7 = 1 sharpens Case (ii). 
Consider a sequence of estimators A(n, hn) with 


hn = hen? (7.10) 


for some positive h, and y. From our assumptions on bias and variance, we 
get 


2 
MSE(A) = b2n28 + Z 11 


up to terms that are of higher order in hn. The value of y that maximizes 
the rate of decrease of the MSE is y = 1/(28 + nì, from which we draw the 
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natural implication that with a smaller bias (larger 3), we can use a larger 
increment h. If we substitute this value of y into (7.11) and take the square 
root, we find that 
RMSE(A) = 0 (n 7), 


and this is a reasonable measure of the convergence rate of the estimator. 
Taking this analysis one step further, we find that 


n?8/(28+1) MSE(A) — 6228 + o2hy”: 


minimizing over h, yields an optimal value of 


hy. <= no” ee : 
280 


The results of this analysis are summarized in Table 7.1 for the forward- 
and central-difference estimators using independent sampling or common ran- 
dom numbers. The variance and bias columns display leading terms only and 
abbreviate the more complete expressions in (7.9). The table should be under- 
stood as follows: if the leading terms of the variance and bias are as indicated 
in the second and third columns, then the conclusions in the last three columns 
hold. The requirement that b not equal zero in (7.9) translates to a’"(@) 4 0 
and a’”"(@) Æ 0 for the forward and central estimators, respectively. The re- 
sults in the table indicate that, at least asymptotically, Aci dominates the 
other three estimators because it exhibits the fastest convergence. 


Estimator Variance Bias Optimal h, Convergence hy 


A Gi n — = 
Ari Ti la'(@h OY O(n 4) (nih 


Aci -F T (O)h? O(n) O(n?) (5 OK ) 
A o2 a / E pe 
Aru Ee la”(@)h Ofn?) Oln!) ( 


Â as TC ii 1 OR? O( ua) ey -2/5) 90% ii 1/5 
Cii ey Nes go nr n POOE 


Table 7.1. Convergence rates of finite-difference estimators with optimal increment 
hn. The estimators use either forward (F) or central (C) differences and either 
independent sampling (2) or common random numbers (ii). 


Glynn [153] proves a central limit theorem for each of the cases in Table 7.1. 
These results take the form 


nT [A(n, hn) — a (0)] => N (one i) 


ae 
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with An as in (7.10) and y = 1/(26 +7). The limit holds for any h, > 0; the 
optimal values of h, in Table 7.1 minimize the second moment of the limiting 
normal random variable. 

For the forward-difference estimator with independent samples, the vari- 
ance parameter oz, is given by 

Obj = lim (Var[Y (6 + h)| + Var[Y(@)]) = 2Var[Y (6)] 

under the minimal assumption of continuity of Var[Y (@)]. Because the central- 
difference estimator has a denominator of 2h, oĉ; = Var[Y(0)]/2. The para- 
meters Ohi and Cb ii do not admit a simple description. Their values depend 
on the joint distribution of (Y (0 — h), Y (8), Y (0 + h)), which depends in part 
on the particular algorithm used for simulation — different algorithms may re- 
spond differently to changes in an input parameter with the random numbers 
held fixed. In contrast, Var[Y (0)] is determined by the marginal distribution of 
Y (0) so all algorithms that legitimately sample from this distribution produce 
the same variance. The variance of the finite-difference estimators appear in 
the optimal values of h,; though these are unlikely to be known in advance, 
they can be estimated from preliminary runs and potentially combined with 
rough estimates of the derivatives of œ to approximate h,. 

Case (iii) of (7.8) is not reflected in Table 7.1. When it applies, the mean 
square error takes the form 


A o2 
MSE(A) = bh? + = 


and there is no tradeoff between bias and variance. We should take h,, as 
small as possible, and so long as nh2? is bounded, the RMSE is O(n~1!/?). 
This dominates all the convergence rates in Table 7.1. 

The distinction between Cases (ii) and (iii) is illustrated in Figure 7.2. 
The figure compares RMS relative errors of forward-difference estimators of 
delta for a standard call option paying (S(T) — K) and a digital option 
paying 1{S(T) > K}, with model parameters K = S(0) = 100, ø = 0.30, r = 
0.05, and 7’ = 0.25. The forward-difference estimators use common random 
numbers, which in this example simply means using the same draw from the 
normal distribution to generate S(T) from both S(0) and S(0) + h. This 
example is simple enough to allow exact calculation of the RMSE. To simplify 
comparison, in the figure we divide each RMSE by the true value of delta to 
get a relative error. The figure shows the effect of varying h with the number 
of replications n fixed at 5000. 

The standard call option fits in Case (iii) of (7.8); this will be evident from 
the analysis in Section 7.2.2. As expected, its RMS relative error decreases 
with decreasing h. The digital option fits in Case (ii) so its relative error 
explodes as h approaches zero. From the figure we see that the relative error 
for the digital option is minimized at a surprisingly large value of about 4; 
the figure also shows that the cost of taking h too large is much smaller than 
the cost of taking it too small. 
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Digital 


Standard Call 


RMS Relative Error 


Size of Increment 


Fig. 7.2. RMS relative errors in forward-difference delta estimates for a standard 
call and a digital option as a function of the increment h, with n = 5000. 


Extrapolation 


For a smooth function a(@) = E[Y(@)], the bias in finite-difference estimates 
can be reduced through extrapolation, much as in Section 6.2.4. This tech- 
nique applies to all the finite-difference estimators considered above; we illus- 
trate it in the case of ÂC ii, the central-difference estimator using common 


random numbers. 
A Taylor expansion of a(@) shows that 


d+h)-—alé@—h 1 
a( T ) a( ) n aœ (8) E =a" (0)h? Aa O(h*); 

2h 6 
odd powers of h are eliminated by the symmetry of the central-difference 
estimator. Similarly, 

2h) — — 2h 2 
a(@ + ) a(? ) = a (0) ra Za (Oh? + O(h*). 

Ah 3 

It follows that the bias in the combined estimator 


4 » l3 
3 Acailn, h) a 3 ACi (n, 2h) 


is O(h*). Accordingly, the RMSE of this estimator is O(n~4/9) if h is taken 
to be-O(n~!/9). The ninth root of n varies little over practical sample sizes n, 
so this estimator achieves a convergence rate of nearly n~!/? with hn nearly 


constant. 


Second Derivatives 


The analysis leading to Table 7.1 extends, with evident modifications, to finite- 
difference estimators of second derivatives. Consider a central-difference esti- 


mator of the form 
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Fn (6 +h) — 2¥,(0) + ¥n(0 — h) 
oe (7.12) 


Its expectation is 


a(l + h) — 2a(6) + a(0 — h) 
h2 


if a is four times differentiable. The bias is then O(h”). 

If values at 0 and 6 +h are simulated independently of each other, the 
numerator in (7.12) has variance that is O(1) in h, so the variance of the 
ratio is O(h~*). By using common random numbers, we can often reduce 
the variance of the numerator to O(h), but even in this case the estimator 
variance is O(h~*). With hn chosen optimally, this results in a covergence rate 
of O(n~?/7) for the RMSE. This makes precise the idea that estimating second 
derivatives is fundamentally more difficult than estimating first derivatives. 

For problems in which Case (iii) of (7.8) applies — the most favorable case 
for derivative estimation — the estimator in (7.12) often has variance O(h~'). 
Because of the nondifferentiability of option payoffs, there are no interesting 
examples in which the variance of the numerator of (7.12) is smaller than 
O(h?), and there are plenty of examples in which it is O(h). 


= a" (0) + O(h?), 


Multiple Parameters 


Suppose now that 0 is a vector of parameters (61,...,0m) and that we need 
to estimate the sensitivity of a(@) = E[Y(@)] to each of these parameters. A 
straightforward approach selects an increment h; for each 6; and estimates 


ða; using 


i E Oe Fe T E E 

or the corresponding forward-difference estimator. In addition to the prob- 
lem of selecting an appropriate h;, this setting poses a further computational 
challenge. It requires estimation of a(@) at 2m + 1 values of 0 using central 
differences and m +1 values using forward differences. This difficulty becomes 
even more severe for second derivatives. Finite-difference estimation of all sec- 
ond derivatives 0°a/06;00; requires simulation at O(m?) parameter values. 
This may be onerous if m is large. In pricing interest rate derivatives, for ex- 
ample, we may be interested in sensitivities with respect to all initial forward 
rates or bond prices, in which case m could easily be 20 or more. 

Techniques from the design of experiments and response surface method- 
ology are potentially useful in reducing the number of parameter values at 
which one simulates. For this formulation, let AY denote the change in Y 
resulting from incrementing each 6; by h;, 2 = 1,..., m. The problem of esti- 
mating sensitivities can be viewed as one of fitting a first-order model of the 


form 


386 7 Estimating Sensitivities 


AY =X bihi+e (7.13) 


1=1 


or a second-order model of the form 


AY =X bihi +X X bijhihj + €. (7.14) 
=! 


i=1 j=i 


Each 6; approximates the partial derivative of a with respect to 0;, 2(;; ap- 
proximates 070/007, and each };;, j #71, approximates 07a/00;00;. In both 
cases, € represents a residual error. 

Given observations of Y at various values of 6, the coefficients 6i, Bij can 
be estimated using, e.g., least squares or weighted least squares. Response 
surface methodology (as in, e.g., Khuri and Cornell [210]) provides guidance 
on choosing the values of @ at which to measure (i.e., simulate) Y. 

In the terminology of experimental design, simulating at all 2” points 
defined by adding or subtracting h; to each 6; is a full factorial design. Simu- 
lating at fewer points may add bias to estimates of some coefficients. But by 
reducing the number of points at which we need to simulate, we can increase 
the number of replications at each point and reduce variance. A design with 
fewer points thus offers a tradeoff between bias and variance. 

Most of the literature on response surface methodology assumes indepen- 
dent residuals across observations. In simulation, we have the flexibility to 
introduce dependence between observations through the assignment of the 
output of the random number generator. Schruben and Margolin [323] ana- 
lyze the design of simulation experiments in which the same random numbers 
are used at multiple parameter values. They recommend a combination of 
common random numbers and antithetic sampling across parameter values. 


7.2 Pathwise Derivative Estimates 


This section and the next develop alternatives to finite-difference methods 
that estimate derivatives directly, without simulating at multiple parameter 
values. They do so by taking advantage of additional information about the 
dynamics and parameter dependence of a simulated process. 


7.2.1 Method and Examples 


In our discussion of optimal mean square error in Section 7.1, we noted that 
in Case (iii) of (7.8) the MSE decreases with the parameter increment h. This 
suggests that we should let h decrease to zero and estimate the derivative of 


a(@) = E[Y(@)] using 


Y'(0) = lim Y(@+h)-Y(0) 


lim 3 (7.15) 
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This estimator has expectation E[Y’(@)]. It is an unbiased estimator of a’(@) 


if 
E 5x0 sL prO (7.16) 
dé «dé ) 
i.e., if the interchange of differentiation and expectation is justified. 

Even before discussing the validity of (7.16), we need to clarify what we 
mean by (7.15). Up until this point, we have not been explicit about what 
dependence, if any, the random variable Y has on the parameter @. In our 
discussion of finite-difference estimators, the notation Y(@) indicated an out- 
come of Y simulated at parameter value 0, but did not entail a functional 
relation between the two. Indeed, we noted that the relation between Y (6) 
and Y(@+h) could depend on whether the two outcomes were simulated using 
common random numbers. 

To make (7.15) precise, we need a collection of random variables {Y (0), @ € 
©} defined on a single probability space (Q, F, P). In other words, Y(@) is a 
stochastic process indexed by 0 € ©. Take © C R to be an interval. We can 
fix w € Q and think of the mapping 0 + Y(6,w) as a random function on 
©. We can then interpret Y’(@) = Y’(@,w) as the derivative of the random 
function with respect to 0 with w held fixed. In (7.15), we implicitly assume 
that the derivative exists with probability one, and when this holds, we call 
Y'(@) the pathwise derivative of Y at 0. 

As a practical matter, we usually think of each w as a realization of the 
output of an ideal random number generator. Each Y (6,w) is then the output 
of a simulation algorithm at parameter 0 with random number stream w. Each 
Y’(0,w) is the derivative of the simulation output with respect to 0 with the 
random numbers held fixed. The value of this derivative depends, in part, on 
how we implement a simulation algorithm. 

We will see through examples that the existence with probability 1 of 
the pathwise derivative Y’(0) at each 0 typically holds. This is not to say 
that, with probability 1, the mapping 0 +> Y(6) is a differentiable function 
on ©. The distinction lies in the order of quantification: the exceptional set 
of probability 0 on which (7.15) fails to exist can and often does depend on 
0. The union of these exceptional sets for 0 ranging over O may well have 
positive probability. 

There is a large literature on pathwise derivative estimation in the discrete- 
event simulation literature, where it is usually called infinitesimal perturbation 
analysis. This line of work stems primarily from Ho and Cao [186] and Suri and 
Zazanis [339]; a general framework is developed in Glasserman [138]. Broadie 
and Glasserman [64] apply the method to option pricing and we use some 
of their examples here. Chen and Fu [80] develop applications to mortgage- 
backed securities. 

To illustrate the derivation and scope of pathwise derivative estimators, 


we now consider some examples. 
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Example 7.2.1 Black-Scholes delta. The derivative of the Black-Scholes for- 
mula with respect to the initial price S(0) of the underlying asset can be cal- 
culated explicitly and is given by ®(d) in the notation of (1.44). Calculation of 
the Black-Scholes delta does not require simulation, but nevertheless provides 
a useful example through which to introduce the pathwise method. 


Let 
Ye STe 


with i 
Sese a v Aa N(0,1), (7.17) 


and take 0 to be S(0), with r, o, T, and K positive constants. Applying the 
chain rule for differentiation, we get 
dY dY dS(T 
E E E (7.18) 
dS(0) dS(T) dS(0) 


For the first of these two factors, observe that 


0, x< K 


d 
É maxs- K) = 45 rS K 


dx 
This derivative fails to exist at x = K. But because the event {S(T) = K} 
has probability 0, Y is almost surely differentiable with respect to S(T) and 
has derivative ay 
—rT 
eS 145(7) > K}. zl 
Bm e HSO) >K) (7.19) 
For the second factor in (7.18), observe from (7.17) that S(T) is linear in 
S(0) with dS(T)/dS(0) = S(T)/S(0). Combining the two factors in (7.18), 
we arrive at the pathwise estimator 
dY S(T) 
—— =e" —“1{S(T) > K 7.20 


which is easily computed in a simulation of S(T). The expected value of this 
estimator is indeed the Black-Scholes delta, so the estimator is unbiased. 
A minor modification of this derivation produces the pathwise estimator 
of the Black-Scholes vega. Replace (7.18) with 
dY dY dS(T) 


do  dS(T) do 


The first factor is unchanged and the second is easily calculated from (7.17). 
Combining the two, we get the pathwise estimator 


us = eT (~oT + VFZ)S(T)U S(T) > K}. 


7.2 Pathwise Derivative Estimates 389 


The expected value of this expression is the Black-Scholes vega, so this esti- 


mator is unbiased. 
Using (7.17) we can eliminate Z and write the vega estimator as 


dY gT (OO A av") S(T)1{S(T) > K}. 


do oj 


This formulation has the feature that it does not rely on the particular ex- 
pression in (7.17) for simulating S(T). Although derived using (7.17), this 
estimator could be applied with any other mechanism for sampling S(T) from 
its lognormal distribution. O 


This example illustrates a general point about differentiability made at the 
beginning of this section. Consider the discounted payoff Y as a function of 
S (0). If we fix any S(0) > 0, then Y is differentiable at S(0) with probability 
1, because for each S(0) the event {S(T) 4 K} has probability 1. However, 
the probability that Y is a differentiable function of $(0) throughout (0, co) is 
zero: for each value of Z, there is some 5(0) at which Y fails to be differentiable 
— namely, the S(0) that makes S(T) = K for the given value of Z. 


Example 7.2.2 Path-dependent deltas. As in the previous example, suppose 
the underlying asset is modeled by geometric Brownian motion, but now let 
the payoff be path-dependent. For example, consider an Asian option 


a — [res 
— ~-rt = + EEE , 
Y=e"|[§—K]*, 5 A 


i=1 


for some fixed dates 0 < tı < --- < tm < T. Much as in Example 7.2.1, 


dY dY dS e ds 
em | r 1 K . 
is) ooo — TRO 
Also, 
1< o 8 


~ ‘oa m 4 S(0)' 


The pathwise estimator of the option delta is 


dY 


as(0) |! ery 1{S > k}—~ 


5 o 

This estimator is in fact unbiased; this follows from a more general result in 
Section 7.2.2. Because there is no formula for the price of an Asian option, this 
estimator has genuine practical value. Because S would be simulated anyway 
in estimating the price of the option, this estimator requires negligible addi- 
tional effort. Compared with a finite-difference estimator, it reduces variance, 
eliminates bias, and cuts the computing time roughly in half. 


390 7 Estimating Sensitivities 


Similar comments apply in the case of a lookback put with discounted 
payoff 
Vee ( max S(t;) — S(tm) l 


1<i<m 
The pathwise estimator of delta simplifies to Y/S(0), which is also unbiased. 
Using similar arguments, we could also derive pathwise estimators of delta 
and other derivatives for barrier options. However, for reasons discussed in 
Section 7.2.2, these are not in general unbiased: a small change in S(0) can 


result in a large change in Y. O 
Example 7.2.3 Path-dependent vega. Now consider the sensitivity of the 
Asian option in the previous example to the volatility o. Let 


1 
S(t) = (tye 20 Me alto Vet i1, m, (7.21) 


with Z1,..., Zm independent N(0,1) random variables. The parameter o 
affects S(t;) explicitly through this functional relation but also implicitly 
through the dependence of S(t;_1) on ø. By differentiating both sides we 
get a recursion for the derivatives along the path: 


as ~ a ae Ca + S(t;)[-o(t; — 1) + Vti — ti-12il]. 


With initial condition dS(0)/do = 0, this recursion is solved by 


GON 3 Genii D Vti — tj-1Zj], 


do 


which can also be written as 
dS(t;) | 
do 


The pathwise estimator of the Asian option vega is 


a 


= S(t;)[log(S(t:)/S(0)) — (r + 30° )t:]/0. 


oor t dS (ti) 


m 4 do 
41=1 


1{S > K}. 


This example illustrates a general technique for deriving pathwise estimators 
based on differentiating both sides of a recursive expression for the evolution 
of an underlying asset. This idea is further developed in Section 7.2.3. O 


Example 7.2.4 Options on multiple assets. Suppose the assets S1, ..., Sq are 
modeled by a multivariate geometric Brownian motion GBM(r, X) as defined 
in Section 3.2.3. Their values can be generated by setting 


eee mE 
S(T) = S,(VeC 727 )T+HYTX: i =1,...,d, 
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with (X1,..., Xa) drawn from N(0,¥). It follows that in this construc- 
tion dS;(T)/dS;(0) = 5;(T)/S;(0), just as in the univariate case, and also 
dS;(T)/dS;(0) = 0, for j # i. 

A spread option with discounted payoff 


Y =e" [(So(T) — S1(7)) - KT 


has a delta with respect to each underlying asset. Their pathwise estimators 
are 


OY a5. E. Sa (T) 
ðS2(0) S S3 (0) 1{So(T) —Si(T) > K} 
and 
ƏY ——— ral) 
a0 e g eO lT) > K) 


An option on the maximum of d assets with discounted payoff 
Y =e" [max{ 81 (T), ..., S(T} — K]* 


has a delta with respect to each underlying asset. A small change in S;(T) 


affects Y only if the ith asset attained the maximum and exceeded the strike, 
SO 
2 eT? 1{S,(T) > S(T), S&T) > K} 
A 1 Max djl }, di fe 
0S;(T) jži ” 
Multiplying this expression by S;(T')/.S;(0) yields the pathwise estimator for 
the ith delta. O 


We introduced these examples in the setting of geometric Brownian motion 
for simplicity, but the expressions derived in the examples apply much more 
generally. Consider an underlying asset S(t) described by an SDE 


Do = p(t) dt + o (t) dW (t), 


in which u(t) and a(t) could be stochastic but have no dependence on S(O). 
The asset price at T is 


T T 
S(T) = S(0) exp ( / (u(t) — 30° (t)) dt + i a(t) awo) 


and we still have dS(T)/dS(0) = S(T})}/S(0). Indeed, this expression for the 
derivative is valid whenever S(t) is given by S(0) exp(X(t)) for some process 
X that does not depend on S(0). 


Example 7.2.5 Square-root diffusion. As an example of a process that is not 
linear in its initial state, consider a square-root diffusion 


dX(t) = a(b — X(t)) dt + o/y XŒ) dW (t). 
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From Section 3.4 (and (3.68) in particular) we know that X(t) has the dis- 
tribution of a multiple of a noncentral chi-square random variable with a 
noncentrality parameter proportional to X (0): 


X(t) ~ crx; (c2X (0)). 


See (3.68)—(3.70) for explicit expressions for c1, c2, and v. As explained in 
Section 3.4, if v > 1 we can simulate X(t) using 


MOG ( (z+ Vea) + da) | (7.22) 


with Z ~ N(0,1) and y2%_, an ordinary chi-square random variable having 
v — 1 degrees of freedom, independent of Z. It follows that 


AT Ipen 
dX(0) CP * s/c X(0) J 


This generalizes to a path simulated at dates tı < tg < --- through the 


recursion 
dX ti): _. Liv) dX (ti) 
ee SC | La a 
dX (0) Ve2X (ti) } 4X (0) 


with Zi+ı used to generate X (t;41) from X (t:i). The coefficients c1, co depend 
on the time increment t;+ı — t;, as in (3.68)—(3.70). These expressions can be 
applied to the Cox-Ingersoll-Ross [91] interest rate model and to the Heston 
[179] stochastic volatility model in (3.65)—(3.65). O 


Example 7.2.6 Digital options and gamma. Consider a digital option with 
discounted payoft 
Yee" 11S(7) > K}, 

and for concreteness let S be modeled by geometric Brownian motion. Viewed 
as a function of S(T), the discounted payoff Y is differentiable except at 
S(T) = K, which implies that Y is differentiable with probability 1. But 
because Y is piecewise constant in S(T), this derivative is 0 wherever it exists, 
and the same is true of Y viewed as a function of S(0). Thus, 


dY d 


This is an example in which the pathwise derivative exists with probability 
1 but is entirely uninformative. The change in E[Y] with a change in S(0) 
is driven by the possibility that a change in S(0) will cause S(T) to cross 
the strike K, but this possibility is missed by the pathwise derivative. The 
pathwise derivative sees only the local insensitivity of Y to S(0). This also 
explains the qualitatively different behavior as A — 0 for the standard call 
and the digital option in Figure 7.2. 
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For much the same reason, the pathwise method is generally inapplicable to 
barrier options. On a fixed path, a sufficiently small change in the underlying 
asset will neither create nor eliminate a barrier crossing, so the effect of the 
barrier is missed by the pathwise derivative. 

The example of a digital option also indicates that the pathwise method, 
at least in its simplest form, is generally inapplicable to estimating second 
derivatives. If, for example, Y is the discounted payoff of an ordinary call op- 
tion, then its first derivative (in S(T)) has exactly the form of a digital payoff 
— see (7.19). The pathwise method faces the same difficulty in estimating the 
gamma, of a standard call option as it does in estimating the delta of a digital 


option. O 


7.2.2 Conditions for Unbiasedness 


Example 7.2.6 points to a limitation in the scope of the pathwise method: it 
is generally inapplicable to discontinuous payoffs. The possible failure of the 
interchange of derivative and expectation in (7.16) is a practical as well as 
theoretical issue. We therefore turn now to a discussion of conditions ensuring 
the validity of (7.16) and thus the unbiasedness of the pathwise method. 

We continue to use 0 to denote a generic parameter and assume the ex- 
istence of a random function {Y (0),0 € ©} with © an interval of the real 
line. As explained at the beginning of Section 7.2.1, this random function 
represents the output of a simulation algorithm as a function of 0 with the 
simulation’s random numbers held fixed. We consider settings in which the 
derivative Y’(@) exists with probability 1 at each 6 € ©. As the examples of 
Section 7.2.1 should make evident, this is by no means restrictive. 

Given the existence of Y’(@), the most important question is the validity 
of (7.16), which is a matter of interchanging a limit and an expectation to 


cama Y(@+h Y(6@ V(@+h Y(@ 
Elim 2+) - YO] _ 5p [VCO + A) Y(8) 
h—0 h h—0 h 


A necessary and sufficient condition for this is uniform integrability (see Ap- 
pendix A) of the difference quotients h~'[Y(6 + h) — Y(@)]. Our objective is 
to provide sufficient conditions that are more readily verified in practice. 

Inspection of the examples in Section 7.2.1 reveals a pattern in the deriva- 
tion of the estimators: we apply the chain rule to express Y’(@) as a product 
of terms, the first relating the discounted payoff to the path of the underlying 
asset, the second relating the path to the parameter. Because this is a natural 
and convenient way to derive estimators, we formulate conditions to fit this 
approach, following Broadie and Glasserman [64]. 

We restrict attention to discounted payoffs that depend on the value of 
an underlying asset or assets at a finite number of fixed dates. Rather than 
distinguish between multiple assets and values of a single asset at multiple 
dates, we simply suppose that the discounted payoff is a function of a random 


394 7 Estimating Sensitivities 


vector X (0) = (X1(0),...,Xm(@)), which is itself a function of the parameter 
0. Thus, 
for some function f : R” — R depending on the specific derivative security. 


We require 
(A1) At each 6 € ©, X;(0) exists with probability 1, for alli =1,...,m. 


If f is differentiable, then Y inherits differentiability from X under condi- 
tion (A1). As illustrated in the examples of Section 7.2.1, option payoffs often 
fail to be differentiable, but the points at which differentiability fails can often 
be ignored because they occur with probability 0. To make this precise, we 
let Dx C R™ denote the set of points at which f is differentiable and require 


(A2) P(X(@) € Dy) =1 for all 0 € O. 


This then implies that Y’(@) exists with probability 1 and is given by 
m a f 
¥"() = > A(X (0) X1(0), 
i=1 ~"* 

Comparison of Examples 7.2.1 and 7.2.6 indicates that even if points of 
nondifferentiability of f occur with probability 0, the behavior of f at these 
points is important. For both the standard call and the digital option, differ- 
entiability fails on the zero-probability event {S(T) = K}; but whereas the 
payoff of the call is continuous as S(T) crosses the strike, the payoff of the 
digital option is not. This distinction leads to an unbiased estimator in the 
first case and not the second. In fact we need a bit more than continuity; a 


convenient condition is 


(A3) There exists a constant kş such that for all x,y € R”, 
If (x) — FI < Kelle — yll; 


i.e., f is Lipschitz. 


The standard call, Asian option, lookback, spread option, and max option 
in Section 7.2.1 all satisfy this condition (as does any composition of linear 
transformations and the functions min and max); the digital payoff does not. 
We impose a related condition on X (6): 


(A4) There exist random variables «;, i = 1,..., m, such that for all 61,02 € 


’ 


|X: (02) — Xi(A1)| < Ki {92 — 81l, 


and Efk;] < œ, i= 1,..., m. 
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Conditions (A3) and (A4) together imply that Y is almost surely Lipschitz 
in 9 because the Lipschitz property is preserved by composition. Thus, 


Y (02) = ¥(61)| < wy |A2 - 91], 
and for ky we may take 
KY = kf > Ki. 
i=1 
It follows that E[ky] < oo. Observing that 


pee =e) uae 


we can now apply the dominated convergence theorem to interchange expec- 
tation and the limit as h — 0 and conclude that dE[Y (@)|/d@ exists and equals 
E[Y’(@)|. In short, conditions (A1)—(A4) suffice to ensure that the pathwise 
derivative is an unbiased estimator. 

Condition (A4) is satisfied in all the examples of Section 7.2.1, at least if 
© is chosen appropriately. No restrictions are needed when the dependence 
on the parameter is linear as in, e.g., the mapping from S(0) to S(T) for 
geometric Brownian motion. In the case of the mapping from ø to S(T), (A4) 
holds if we take © to be a bounded interval; this is harmless because for the 
purposes of estimating a derivative at o we may take an arbitrarily small 
neighborhood of ø. In Example 7.2.5, for the mapping from X (0) to X(t) to 
be Lipschitz, we need to restrict X(0) to a set of values bounded away from 
Zero. 

If conditions (A1)—(A4) hold and we strengthen (A4) slightly to require 
E[K?] < œ, then E[K¢] < co and 


ECY (0 + h) — Y (0))?] < Ela? ]n?, 


from which it follows that Var[Y (0 + h) — Y(@)] = O(h?), as in Case (iii) of 
(7.8). Conversely, if Case (iii) holds, if Y’(@) exists with probability 1, and 
E|Y (0)] is differentiable, then 


El(Y(@+h) —Y(6))?] = Var[Y (0 +h) —Y(6)] + (E[Y (6+h) —Y(0)])* = O(h?). 


Hence, 


: (rere 2e 


h 


remains bounded as h — 0 and [Y (0 + h) — Y(@)|/A is uniformly integrable. 
Thus, the scope of the pathwise method is essentially the same as the scope 
of Case (iii) in (7.8). 

Of conditions (A1)-(A4), the one that poses a practical limitation is (A3). 
As we noted previously, the payoffs of digital options and barrier options are 
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not even continuous. Indeed, discontinuities in a payoff are the main obstacle 
to the applicability of the pathwise method. A simple rule of thumb states that 
the pathwise method applies when the payoff is continuous in the parameter 
of interest. Though this is not a precise guarantee (of the type provided by 
the conditions above), it provides sound guidance in most practical problems. 


7.2.3 Approximations and Related Methods 


This section develops further topics in sensitivity estimation using pathwise 
derivatives. Through various approximations and extensions, we can improve 
the efficiency or expand the scope of the method. 


General Diffusion Processes 
Consider a process X described by a stochastic differential equation 
dX (t) = a( X(t)) dt + b(X (t)) dW (t) (7.23) 


with fixed initial condition X (0). Suppose, for now, that this is a scalar process 
driven by a scalar Brownian motion. We might simulate the process using an 
Euler scheme with step size h; using the notation of Chapter 6, we write the 


discretization as 


Aw 


X(i +1) = X(i) +a( (h + O(X(4))VhAZi41, X(0) = X(0), 


with X(i) denoting the discretized approximation to X (th) and Z1, Z2,... 
denoting independent (0,1) random variables. 
Let : š 
rya dX (i) 
= dX(0) 


By differentiating both sides of the Euler scheme we get the recursion 


Â(i +1) = Â(i) +a (X(i))A@A +0 (X(@))A@VAZi41, Â(0)=1, (7.24) 
with a’ and b’ denoting derivatives of the coefficient functions. With 
fA) ag X (m) 


denoting the discounted payoff of some option, the pathwise derivative of the 
option’s delta is 


DBs; (X(1),...,X(m))A(i). 


A similar estimator could be derived by differentiating both sides of a higher- 
order discretization of X. 
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Under relatively minor conditions on the coefficient functions a and b, the 
(continuous-time) solution X(t) to the SDE (7.23) is almost surely differen- 
tiable in X (0). Moreover, its derivative A(t) = dX (t)/dX (0) satisfes 


dA(t) = a'(X(t))A(t) dt + b'(X(t))A(t) dW(t), A(0) = 1. (7.25) 


See Section 4.7 of Kunita [217] or Section 5.7 of Protter [800] for precise 
results of this type. It follows that (7.24) — derived by differentiating the 
Euler scheme for X — is also the Euler scheme for A. Similar results apply 
to derivatives of X(t) with respect to parameters of the coefficient functions 
a and b. 

In light of these results, there is no real theoretical obstacle to developing 
pathwise estimators for general diffusions; through (7.25), the problem reduces 
to one of discretizing stochastic differential equations. There may, however, be 
practical obstacles to this approach. The additional effort required to simulate 
(7.25) may be roughly the same as the effort required to simulate a second 
path of X from a different initial state. (Contrast this with examples in Sec- 
tion 7.2.1 for which dS(T)/dS(0) reduces to S(T)/S(0) and is thus available 
at no additional cost.) Also, in the case of a d-dimensional diffusion X, we 
need to replace (7.25) with a d x d system of equations for 


OX; (Et) 
Agt) = ae e rene 2 
Simulating a system of SDEs for this matrix of derivatives may be quite time- 


consuming. 
These considerations motivate the use of approximations in simulating the 


derivatives of X(t). One strategy is to use coarser time steps in simulating the 
derivatives than in simulating X. There are many ways of implementing this 
strategy; we discuss just one and for simplicity we consider only the scalar 


case. 
Freezing a’(X(t)) and b’(X(t)) at their time-zero values transforms (7.24) 


into 


Ati) = Â(i — 1) (1 + a'(X(0))h + O'(XO)VAZ). 


This is potentially much faster to simulate than (7.24), especially if evaluation 
of a’ and b’ is time-consuming. To improve the accuracy of the approximation, 
we might update the coefficients after every k steps. In this case, we would 
use the coefficiencts a’(X(0)) and b’(X(0)) fori = 1,..., k, then use 


AG) = Â(i — 1) (1+ a (È (WA + V(X (A) VAZ: 


for i =k+1,...,2k, and so on. 

If, for purposes of differentiation, we entirely ignored the dependence of 
a and b on the current state X, we would get simply A(i) = 1. For S(t) = 
S'(0) exp(X (t)), this is equivalent to making the approximation dS (t)/dS(0) = 


OLAV /OIN\ warhieh nra bnan ia avart in cavaral eacadc 
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LIBOR Market. Model 


Similar approximations are derived and tested in the setting of LIBOR mar- 
ket models by Glasserman and Zhao [150]. This application has considerable 
practical value and also serves to illustrate the approach. We use the notation 


of Section 3.7. 
Consider, then, a system of SDEs of the form 


dLn(t) 
L(t) 
with L(t) denoting the M-vector of all rates, W a d-dimensional Brownian 


motion, and each on an R¢-valued deterministic function of time. We get the 
spot measure dynamics (3.112) by setting 


= pin(L(t), t)dt + on(t)' dW(t), n=1,...,M, 


= Lilt); lt)! on(t 
5 (t)o;(t) antt) 


Hn( L(t}, t) i 1+ Ôj Lj (t) l 


j=n(t) 


where, as in Section 3.7, n(t) denotes the index of the next maturity date as 
of time t. 
Let 
OLn(t) 
OL;,(0) 


Ank (t) = 


and write A, for a discrete approximation to this process. From the spot 
measure dynamics it is evident that Ang(t) = 0 unless k < n. Also, Ang (0) = 
Le 

Suppose we simulate the forward rates Ln using a log-Euler scheme as in 
(3.120), with fixed time step hs By differentiating both sides, we get 


Ank(t + 1) = AED 41, Sor: O ĝin i ijk (i)h, (7.26) 


with F 

~ d;L; (i); (ih) "on (th) 
j=n(ih) 1 + Ôj Lj (i) 
Simulating all O(M?°) derivatives using (7.26) is clearly quite time-consuming, 
and the main source of difficulty comes from the form of the drift. 

If we replaced the un(L(t), t) with 


Hn (t) = Hn (L(0), t), (7.27) 


we would make the drift time-varying but deterministic. (The deterministic 
dependence on t enters through 7 and the volatility functions o;.) We used 
this approximation in (4.8) to construct a control variate. Here we preserve 
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the true drift un in the dynamics of Ln but differentiate as though the drift 
were u? ; i.e., as though the forward rates inside the drift function were frozen 
at their initial values. Differentiating the approximation 


£=0 


yields 
S~ Aue (Lh) 


ae _ tL, (i) 
< AL x(0) 


a h. (7.28) 


l{n = k} + L,(i) 


The derivatives of p°, 


Opn (eh) _ Skoilh) on (lh) , 
ALLO) (1+ d,4L%(0))? 


are deterministic functions of time and can therefore be computed once, 
stored, and applied to each replication. Consequently, the computational cost 
of (7.28) is modest, especially when compared with the exact derivative re- 
cursion (7.26), (“exact” here meaning exact for the Euler scheme). 

Glasserman and Zhao [150] report numerical results using these methods 
to estimate deltas of caplets. They find that the approximation (7.28) pro- 
duces estimates quite close to the exact (for Euler) recursion (7.26), with 
some degradation in the approximation at longer maturities. Both methods 
are subject to discretization error but the discretization bias is small — less 
than 0.2% of the continuous-time delta obtained by differentiating the Black 
formula (3.117) for the continuous-time caplet price. 


I{n(lh) < k < n}, 


Smoothing 


We noted in Section 7.2.2 that discontinuities in a discounted payoff generally 
make the pathwise method inapplicable. In some cases, this obstacle can be 
overcome by smoothing discontinuities through conditional expectations. We 
illustrate this idea with two examples. 

Our first example is the discounted digital payoff Y = e~"71{S(T) > K}, 
which also arises in the derivative of a standard call option; see Example 7.2.6. 
If S ~ GBM(u, 07) and we fix 0 < e < T, then 


EYST- o] = ea (EECA) tUr ark), 


Differentiation yields 


dE[Y|S(T—6)] | eae ( log(S(T' — €)/K) + (u — $07)e | 
dS(T-e«) S(T-—eave | 
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and then, through the chain rule, the unbiased delta estimator 


dE[Y|S(T — €)] ert r es — €)/K) + (u — aee) 


dS(0) ss S(O)av/e T/E 


This calculation illustrates that by conditioning on the underlying asset e€ 
time units before expiration, we can smooth the discontinuity in the digital 
payoff and then, by differentiating, arrive at an unbiased derivative estimator. 
Of course, the same argument allows exact calculation of the derivative of 
E[Y] in this example. The value of this derivation lies not so much in pro- 
viding an exact expression for geometric Brownian motion as in providing an 
approximation for more general processes. Equation (7.29) could, for exam- 
ple, be applied to a stochastic volatility model by further conditioning on the 
volatility at T — e and using this value in place of ø. This expression should 
then be multiplied by an exact or approximate expression for dS (T —«€)/d S (0). 
If the underlying asset is simulated using a log-Euler scheme with time step 
c, then (7.29) does not introduce any error beyond that already inherent to 
the discretization method. 

Our second example of smoothing applies to barrier options. We again 
introduce it through geometric Brownian motion though we intend it as an 
approximation for more general processes. Consider a discretely monitored 
knock-out option with discounted payoff 


Vee (50) = K)*1{ min S(t) > bh, 


for some fixed dates 0 < ty < -> < tm = T and barrier b < S(O). The 
knock-out feature makes Y discontinuous in the path of the underlying asset. 
For each i = 0,1,...,m — 1, define the one-step survival probability 


pile) = P(S) > S(t) = 2) =O — e ) 


From S construct a process S having the same initial value as S but with 
state transitions generated conditional on survival: given S(t;1) = x, the 
next state S(t;) has the distribution of S(t;) conditional on $(t;_1) = x and 
S(t;) > b. More explicitily, 


S(t;) = (ti et 27 teat Zi 


where i i 
Zi = (pi-1(Ŝ(ti-1)) + (1 — pi-1(8(ti-1))0:} ; 

with U1,...,Um independent Unif[0,1] random variables. This mechanism 

samples Z; from the normal distribution conditional on S(t;) > b; see Ex- 

ample 2.2.5. 


Using the conditioned process, we can write 
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Ele"? (S(T) — K)+1{ min S(t) >b} 
1l<i<m 


-e jeram -0 TT Seu). (7.30) 


i=0 


Expressions of this form are derived in Glasserman and Staum [146] through 
the observation that switching from S to S can be formulated as a change of 
measure and that the product of the survival probabilities is the appropriate 
likelihood ratio for this transformation. Because the conditioned process never 
crosses the barrier, we can omit the indicator from the expression on the right. 
The product of the survival probabilities smooths the indicator. 

We could now differentiate with respect to S(0) inside the expectation 
on the right side of (7.30) to get an estimator of the barrier option delta. 
Though straightforward in principle, this is rather involved in practice. The 
price we pay for switching from S' to S is the loss of linearity in S (0). Each 
Zi depends on § (t;-1) and thus contributes a term when we differentiate 
S(t;). The practical value of this approach is therefore questionable, but it 
nevertheless serves to illustrate the flexibility we have to modify a problem 
before differentiating. 


7.3 The Likelihood Ratio Method 


As explained in several places in Section 7.2, the scope of the pathwise method 
is limited primarily by the requirement of continuity in the discounted payoft 
as a function of the parameter of differentiation. The likelihood ratio method 
provides an alternative approach to derivative estimation requiring no smooth- 
ness at all in the discounted payoff and thus complementing the pathwise 
method. It accomplishes this by differentiating probabilities rather than pay- 


offs. 


7.3.1 Method and Examples 


As in Section 7.2.2, we consider a discounted payoff Y expressed as a function f 
of a random vector X = (X1,...,Xm). The components of X could represent 
different underlying assets or values of a single asset at multiple dates. In our 
discussion of the pathwise method, we assumed the existence of a functional 
dependence of X (and then Y) on a parameter 0. In the likelihood ratio 
method, we instead suppose that X has a probability density g and that 0 
is a parameter of this density. We therefore write gg for the density, and to 
emphasize that an expectation is computed with respect to gg, we sometimes 


write it as Eg. 
In this formulation, the expected discounted payoff is given by 
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Eol¥] =Eolf(X1s---. Xml =f fle)go(e) de 
RT 


To derive a derivative estimator, we suppose that the order of differentiation 
and integration can be interchanged to get 


Y]= | Fæ) glz) dz: (7.31) 


If this indeed holds, then multiplying and dividing the integrand by gg yields 


ee z) ee ) 

Tej SEIJA 

where we have written gg for dgg/d@. It now follows from this equation that 
the expression 

go(X) 

f(x)—— [32 

E ) a 

is an unbiased estimator of the derivative of Eg|Y]. This is a likelihood ratio 


method (LRM) estimator. 
Three issues merit comment: 


o As with the pathwise method, the validity of this approach relies on an 
interchange of differentation and integration. In practice, however, the in- 
terchange in (7.31) is relatively benign in comparison to (7.16), because 
probability densities are typically smooth functions of their parameters but 
option payoffs are not. Whereas the interchange (7.16) imposes practical 
limits on the use of the pathwise method, the validity of (7.31) is seldom 
an obstacle to the use of the likelihood ratio method. 

o Whether we view @ as a parameter of the path X or of its density is largely a 
matter a choice. Suppose, for example, that X is a normal random variable 
with distribution N(@,1) and Y = f(X) for some function f. The density 
of X is gg(x) = d(x — @) with ¢ the standard normal density. By following 
the steps above, we arrive at the estimator 


-r ae = IK 9) 


for the derivative of Eg[Y]. But we could also write X (0) = 0 + Z with 
Z ~ N(0,1) and apply the pathwise method to get the estimator 


Y'O) = F(X) = r2) 


This illustrates the flexibility we have in representing the dependence on 
a parameter through the path X or its density (or both); this flexibility 
should be exploited in developing derivative estimators. 
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In the statistical literature, an expression of the form gg/gg, often written 
dlog gg/d@, is called a score function. In the estimator (7.32), the score 
function is evaluated at the outcome X. We refer to the random variable 
go(X )/ge(X) as the score. This term is short for “the differentiated log den- 
sity evaluated at the simulated outcome” and also for “the expression that 
multiplies the discounted payoff in a likelihood ratio method estimator.” 


Q 


We illustrate the likelihood ratio method through examples. This approach 
has been developed primarily in the discrete-event simulation literature; im- 
portant early references include Glynn [152], Reiman and Weiss [305], and 
Rubinstein [311]. It is sometimes called the score function method, as in Ru- 
binstein and Shapiro [313]. Broadie and Glasserman [64] and Glasserman and 
Zhao [150] develop applications in finance, and we include some of their ex- 
amples here. 


Example 7.3.1 Black-Scholes delta. To estimate the Black-Scholes delta us- 
ing the likelihood ratio method, we need to view S(0) as a parameter of the 
density of S(T). Using (3.23), we find that the lognormal density of S(T) is 
given by 


o> 1l F TOR log(x/S(0)) — (r — $07)T 
g(x) = za )), (x) = a, ae 


with @ the standard normal density. Some algebra now shows that 


dg(x)/dS(0) sao d¢(x) E log(a/S(0)) — (r — ło?°)T 
g(x) dS(0) S(0)o?T l 


We get the score by evaluating this expression at S(T) and an unbiased esti- 
mator of delta by multiplying by the discounted payoff of the option: 


(7.33) 


If S(T) is generated from S(0) using a standard normal random variable Z 
as in (7.17), then ¢(S(7')) = Z and the estimator simplifies to 
e (S(T) — ee (7.34) 
S(O)oVT 
The form of the option payoff in this example is actually irrelevant; any 
other function of S(T) would result in an estimator of the same form. The 
delta of a digital option, for example, can be estimated using 


eT 1{S(T) > Khao 


This is a general feature of the likelihood ratio method that contrasts markedly 
with the pathwise method: the form of the estimator does not depend on 
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the details of the discounted payoff. Once the score is calculated it can be 
multiplied by many different discounted payoffs to estimate their deltas. 
For estimating vega, the score function is 
dg(z)/do 1 d€(a) 


g(x) ge (a) da ’ 


with | 
d¢ (x) = log(S(0)/x) + (r + go )T 


do o2? yT 


The derivative estimator is the product of the discounted payoff and the score 
function evaluated at x = S(T). After some algebraic simplification, the score 


can be expressed as 


22] 
SANT. 
O 


with Z as in (7.17). O 


Example 7.3.2 Path-dependent deltas. Consider an Asian option, as in Ex- 
ample 7.2.2. The payoff is a function of S(t,),..., 5(tm), so we need the density 
of this path. Using the Markov property of geometric Brownian motion, we 
can factor this density as 


g(21,---,%m) = gi(z1]S(0))g2(r2|271) +++ 9m(Lm|Zm-1), 


where each g;(x;|2;-1), j =1,...,m, is the transition density from time t;j—1 


to time tj, 
1 


(2; |2;-1) = — 
i ei) = Fe a 


Pele ee ee a 
It follows that $(0) is a parameter of the first factor gi(x1|S(0)) but does not 
appear in any of the other factors. The score is thus given by 


dlog g(S(ti),---,5(tm)) _ Aloggi(S(ti)/S(0)) _ GSC) 1S (0)) 


Seem el 


0S(0) 0S(0) S(O)oV/ty 


Plia] 


with 


This can also be written as 7 
ari een (7.35) 


S(O)o yti 
with Zı the normal random variable used to generate S(t,) from S(0), as in 
(7.21). The likelihood ratio method estimator of the Asian option delta is thus 
Z1 


e™™t (S — KY" o (7.36) 
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Again, the specific form of the discounted payoff is irrelevant; the same deriva- 
tion applies to any function of the path of S over the interval [t,, T], including 
a discretely monitored barrier option. 

Observe that the score has mean zero for all nonzero values of S(0), a, 
and tı, but its variance increases without bound as tı | 0. We discuss this 


phenomenon in Section 7.3.2. 0 


Example 7.3.3 Path-dependent vega. Whereas in the previous example the 
parameter S (0) appeared only in the distribution of S(t), the parameter o 
influences every state transition. Accordingly, each transition along the path 
contributes a term to the score. Following steps similar to those in the previous 
two examples, we find that the score 


log g(S (t1), +--+ S(tm)) _ 5 O log g(S(t;)|S(t;-1)) 


Oo Oo 
j=l 
is given by 
Ai OC; 
p (< + GCSES | 


This can be written as 


m [Z231 
D a ota), (7.37) 
j=l 


using the normal random variables Z; in (7.21). O 


Example 7.3.4 Gaussian vectors and options on multiple assets. The prob- 
lem of LRM delta estimation for an option on multivariate geometric Brownian 
motion can be reduced to one of differentiation with respect to a parameter of 
the mean of a normal random vector. Suppose therefore that X ~ N(u(0), ©) 
with @ a scalar parameter of the d-dimensional mean vector u, and X a d x d 
covariance matrix of full rank. Let g = gg denote the multivariate normal 
density of X, as in (2.21). Differentiation reveals that the score is 


< log go(X) = (X — u(0)) ' U7 A(O), 


with 4(@) the vector of derivatives of the components of u with respect to the 
parameter 6. If we simulate X as u(0)+ AZ with Z ~ N (0, I) for some matrix 
A satisfying AA' = 5, then the score simplifies to 


d | 
zg 10890 (X) = Z' A! a(0). (7.38) 


If 0 is a parameter of X rather than u, the score is 
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dlog ge(X )/dé 
= itr CHORO) +(X -WTE HOÈOET HOX — u) (7.39) 


= —ltr (4-1(6)3:()A-1(8) ) +4Z'A*(0)X(0)A'(0)'Z, (7.40) 


where “tr” denotes trace, A(@)A(@)' = 5(6), and 4(6) the matrix of deriva- 


tives of the elements of X (0). 
Now consider a d-dimensional geometric Brownian motion 


dSi;(t) 

Si(t) 
with W a d-dimensional standard Brownian motion. Let A be the dx d matrix 
with rows v;,...,v, and suppose that A has full rank. Let © = AA'. Then 
S(T) has the distribution of exp(X;), i=1,...,d, with 


X~\N(u,TE), pmi = log S:(0) + (r — 4llvill?)T. 


We may view any discounted payoff f(51(T),...,Sa(T)) as a function of the 
vector X. To calculate the delta with respect to the ith underlying asset, we 
take S;(0) as a parameter of the mean vector u. The resulting likelihood ratio 


method estimator is 


= rdt +v; dW(t), i=1,...,d, 


(ZTA); 
S;(0)VT ` 


the numerator of the score given by the ith component of the vector (Z' A+). 
E 


FST), -3 Sal(T)) - 


Example 7.3.5 Square-root diffusion. We use the notation of Example 7.2.5. 
We know from Section 3.4.1 that X (t)/cı has a noncentral chi-square density 
and that X (0) appears in the noncentrality parameter. To. estimate the sensi- 
tivity of some E[f(X(t))| with respect to X (0), one could therefore calculate 
a score function by differentiating the log density (available from (3.67)) with 
respect to the noncentrality parameter. But the noncentral chi-square density 
is cumbersome and its score function still more cumbersome, so this is not a 
practical solution. 

An alternative uses the decomposition (7.22). Any function of X (t) can be 
represented as a function of a normal random variable with mean y c2X (0) 
and an independent chi-square random variable. The expected value of the 
function is then an integral with respect to the bivariate density of these two 
random variables. Because the variables are independent, their joint density is 
just the product of their marginal normal and chi-square densities. Only the 
first of these densities depends on X (0), and X (0) enters through the mean. 
The score function thus reduces to 


d — &y/€2 — c24 X (0) 
5O P 2./X(0) 
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Evaluating this at the normal random variable Z + \/c2X (0) (with Z as in 
(7.22)), we get the score 
TE 


24/ X (0) 
Multiplying this by f(X(t)) produces an LRM estimator of the derivative 
dE[ f(X (t))|/dX (0). For the derivative of a function of a discrete path X (tı), 
..., X (tm), replace Z with Zı in the score, with Zı the normal random 
variable used in generating X (tı) from X (0). O 


7.3.2 Bias and Variance Properties 


The likelhood ratio method produces an unbiased estimator of a derivative 
when (7.31) holds; i.e., when the integral (over x) of the limit as h — 0 of the 


functions i (2) 
go+h\T 
Fær (sa — 1) go(x) (7.41) 
equals the limit of their integrals. Because probability densities tend to be 
smooth functions of their parameters, this condition is widely satisfied. Spe- 
cific conditions for exponential families of distributions are given in, e.g., 
Barndorff-Nielsen [35], and this case is relevant to several of our examples. 
Glynn and L’Ecuyer [158] provide general conditions. In practice, the ap- 
plicability of the likelihood ratio method is more often limited by either (i) 
the need for explicit knowledge of a density, or (ii) a large variance, than by 
the failure of (7.31). 


Absolute Continuity 


As suggested by (7.41), the likelihood ratio method is based on a limit of 
importance sampling estimators (see Section 4.6). For fixed h, the integral 
of (7.41) equals 1/h times the difference Eo+a|f (X)| — Eo|f(X)], the first 
expectation relying on the importance sampling identity 


Eotalf(X)] = Eolf(X)go+n(X)/ge(X )]. 


The validity of this identity relies on an implicit assumption of absolute con- 
tinuity (cf. Appendix B.4): we need ggin(x) > O at all points z for which 


go (x) > 0. 
For a simple example in which absolute continuity fails, suppose X is 


uniformly distributed over (0,0). Its density is 
1 
ge(x) = 9 {0 <x < 6}, 


which is differentiable in 0 at x € (0,6). The score dlog gg(X )/d@ exists with 
probability 1 and equals —1/0. The resulting LRM estimator of the derivative 


of Eg{X] is 
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Eo[X =- = -1, 


whereas the correct value is 


The estimator even fails to predict the direction of change. This failure results 
from the fact that ggi;, is not absolutely continuous with respect to gg. 

A related (and in practice more significant) limitation applies to the mul- 
tivariate normal distribution. Suppose X ~ N(u, ©) and we are interested in 
the sensitivity of some expectation E/f(X)] to changes in 0, with 0 a parame- 
ter of uw or X. If X is a vector of length d and the d x d matrix © has rank 
k < d, then X fails to have a density on R and the likelihood ratio method 
is not directly applicable. As discussed in Section 2.3.3, in the rank-deficient 
case we could express d — k components of X as linear transformations of X, 
a vector consisting of the other k components of X. Moreover, X would then 
have a density in R*. This suggests that we could write f(X) = f(X) for 
some function f and then apply the likelihood ratio method. 

This transformation may not, however, entirely remove the obstacle to 
using the method if it introduces explicit dependence on @ in the function f i 
A simple example illustrates this point. Suppose Zı and Z are independent 
normal random variables and 


Xı Hı 1 0 z 
X> |ļ={ um +ļ{01 ( a (7.42) 
X3 H3 ay a2 í 


We can reduce any function f of (X1, X2, X3) to a function of just the first 
two components by defining 


~ 


f(X1, X2) — f(X1, Xe, 01 (Xı = H1) T a2( X2 ra u2) =F u3). 


The pair (X1, X2) has a probability density in R?. But if any of the u; or a; 
depend on @ then f does too. This dependence is not captured by differenti- 
ating the density of (X1, X2); capturing it requires differentiating f, in effect 
combining the likelihood ratio and pathwise methods. This may not be possi- 
ble if f is discontinuous. The inability of the likelihood ratio method to handle 
this example again results from a failure of absolute continuity: (X1, X2, X3) 
has a density on a two-dimensional subspace of RÌ, but this subspace changes 
with 8, so the densities do not have common support. 


Variance Properties 


A near failure of absolute continuity is usually accompanied by an explosion 
in variance. This results from properties of the score closely related to the 
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variance build-up in likelihood ratios discussed in Section 4.6.1 and Appen- 
dix B.4. 

In (7.42), one could get around the lack of a density by adding /eZ3 to 
X3, with Z3 independent of Z1, Zə. This would lead to a score of the form in 
(7.38) with 
1 0 0 

1 0 : 


0 
salu 1/ Ve 


and thus infinite variance as € — 0. 

Something similar occurs with the score (7.35) for a path-dependent op- 
tion sensitive to the values of GBM(r, o°) at dates t1,...,tm. The score has 
expected value 0 for all ti > 0, but its variance grows without bound as tı 
approaches 0. This, too, is a consequence of a breakdown of absolute con- 
tinuity. The score in (7.35) results from viewing a value of S(t) generated 
from S(O) as though it had been generated from S(0) + A. For tı > 0, all 
positive values of S(t;) are possible starting from both S(0) and S(0) +R; the 
two distributions of S(t) are mutually absolutely continuous and thus have 
a well-defined likelihood ratio. But viewed as probability measures on paths 
starting at time 0 rather tı, the measures defined by S(0) and S(0) + A are 
mutually singular: no path that starts at one value can also start at the other. 
Absolute continuity thus fails as tı — 0, and this is manifested by the increase 
in variance of the score. 

Next we consider the effect of a long time horizon. The score in (7.37) for 
the vega of a function of the discrete path (S(t1),...,5(tm)) is a sum of m 
independent random variables. Its variance grows linearly in m if the spacings 
t; — t;-1 are constant, faster if the spacings themselves are increasing. This 
suggests that the variance of an LRM estimator that uses this score will 
typically increase with the number of dates m. The same occurs if we fix a 
time horizon T, divide the interval [0, T] into m equal time steps, and then let 
m — œ. Recall (cf. the discussion of Girsanov’s Theorem in Appendix B.4) 
that the measures on paths for Brownian motions with different variance 
parameters are mutually singular, so this explosion in variance once again 
results from a breakdown of absolute continuity. 

Figure 7.3 illustrates the growth in variance is estimating vega for an Asian 
option. The model parameters are S(0) = K = 100, o = 0.30, r = 5%; the 
option payoff is based on the average level of the underlying asset over m 
equally spaced dates with a spacing of 1/52 (one week). The value of m varies 
along the horizontal access; the vertical axis shows variance (per replication) 
on a log scale, as estimated from one million replications. The figure shows 
that the variance grows with m for both the LRM and pathwise estimators. 
The variance of the LRM estimator starts higher (by more than a factor of 
10 at m = 2) and grows faster as m increases. 

The growth in variance exhibited by the vega score (7.37) is intrinsic to the 
method: subject only to modest technical conditions, the score is a martingale 


At= 
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Fig. 7.3. Variance of vega estimators for an Asian option with weekly averaging as 
a function of the number of weeks in the average. 


and thus a process of increasing variance. Consider, for example, a simulation 
(as in (4.78)) based on a recursion of the form 


S(tiz1) = G(S (ti), Xi41) (7.43) 


driven by i.i.d. random vectors X1, Xo,.... Suppose @ is a parameter of the 
density gg of the X;. A likelihood ratio method estimator of the derivative 
with respect to 0 of some expectation E|f(S(t1),...,S(tm))] uses the score 


5 ġo(X:) 
£ (Xi) 
This is an i.i.d. sum and a martingale because 


: eal = f gale) de 


d 


using the fact that gọ integrates to 1 for all 0. Alternatively, if gọ(y|x) is the 
transition density of the process S(t;), i = 1,2,..., we could use the score 


5 ġo (S(t) |S (ti-1)) 
— go(S(ti)|S(ti-1)) 
This is no longer an i.i.d. sum, but a similar argument shows that it is still a 
martingale and hence that its variance increases with m. 
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Because the score has mean zero, it provides a candidate control variate. 
Using it as a control removes some of the variance introduced by the score, 
but rarely changes the qualitative dependence on the number of steps m. 


7.3.8 Gamma 


We noted in Example 7.2.6 that the pathwise method does not readily extend 
to estimating second derivatives. With the likelihood ratio method, second 
derivatives are in principle no more difficult to estimate than first derivatives. 
Let gg denote the second derivative of gg in 0. The argument leading to (7.32) 
shows that (X) 
ge 

px (7.44) 
is an unbiased estimator of d*E,g|f(X)]/d07, subject to conditions permitting 
two interchanges of derivative and expectation. These conditions seldom limit 
the applicability of the method. However, just as the score gg(X)/ge(X) can 
lead to large variance in first-derivative estimates, its counterpart in (7.44) 
often produces even larger variance in second-derivative estimates. 

We illustrate the method with some examples. 


Example 7.3.6 Black-Scholes gamma. We apply (7.44) to estimate gamma 
— the second derivative with respect to the underlying price S(0) — in the 
Black-Scholes setting. Using the notation of Example 7.3.1, we find that 


# g(S(T))/45%(0) _ AST)? -1 _ UST) P 
9(5(T) S00°T ~ S(0PoVT ) 


Multiplying this expression by the discounted payoff e~"? (S(T) — K)* pro- 
duces an unbiased estimator of the Black-Scholes gamma. If S(T) is generated 
from S(O) using (7.17), ¢(S(7)) can be replaced with the standard normal 
random variable Z. 

As with LRM first-derivative estimates, the form of the discounted payoff 
is unimportant; (7.45) could be applied with other functions of S(T). Much as 
in Example 7.3.2, this expression also extends to path-dependent options. For 
a function of S(t;),...,5(tm), replace T with tı in (7.45); see the argument 
leading to (7.35). O 


Example 7.3.7 Second derivatives for Gaussian vectors. If X ~ N(1,07) 
then differentiation with respect to u yields 


and 
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Higher orders of differentiation produce higher powers of X and often larger 


variance. 
More generally, let X be a random vector with multivariate normal distri- 


bution N(u(0), ©), with 0 a scalar parameter and X of full rank. Then 


go(X) ae (ix _ u(0)|' ETAC)? te (0) ' S71 L(8) H [X = u(6)]' Et (8), 
go(X) 


where ù and jz denote derivatives of u with respect to 80. O 


An alternative approach to estimating second derivatives combines the 
pathwise and likelihood ratio methods, using each for one order of differen- 
tiation. We illustrate this idea by applying it to the Black-Scholes gamma. 
Suppose we apply the likelihood ratio method first and then the pathwise 
method. Multiplying the discounted payoff by the score yields (7.34), which 
we differentiate to get the “LR-PW” estimator 


d —rT A aa E £ 
IO (e (S(T) RY ae) = TOA > K}K, 
(7.46) 


using dS(T)/dS(0) = S(T)/S(0). If instead we apply pathwise differentiation 
first, we get the delta estimator (7.20). This expression has a functional de- 
pendence on S(0) as well as a distributional dependence on S(0) through the 
density of S(T}. Multiplying by the score captures the second dependence 
but for the first we need to take another pathwise derivative. The resulting 
“PW-LR” estimator is 

(7) 


7 au) ( Z ) 

rT 

=e 1S(P)>Kk ——-1]}. 
(str) > K} (355) (se 

The pure likelihood ratio method estimator of gamma is 


on Via Z 
ETET- O (Soper PT) 


using (7.45). 

Table 7.2 compares the variance per replication using these methods with 
parameters S(0) = 100, o = 0.30, r = 5%, three levels of the strike K, and 
two maturities 7T. The table also compares variances for the likelihood ratio 
method and pathwise estimators of delta; the exact values of delta and gamma 
are included in the table for reference. Each variance estimate is based on 1 
million replications. 

The results in the table indicate that the pathwise estimator of delta has 
lower variance than the likelihood ratio method estimator, especially at higher 
values of S(0)/K. The results also indicate that the mixed gamma estimators 
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have substantially lower variance than the pure likelihood ratio method es- 
timator. The two mixed estimators have similar variances, except at lower 
values of K where LR-PW shows a distinct advantage. This advantage is also 
evident from the expression in (7.46). 


F=] T’ = 0.5 

K 90 100 110 90 100 110 

Delta 0.887 0.540 0.183 0.764 0.589 0.411 
LR 3.4 1.5 0.5 2.¢ 2.0 1.4 
PW 0.1 0.3 0.2 0.3 0.4 0.3 
Gamma 0.020 0.042 0.028 0.015 0.018 0.018 
LR 0.1232 0.0625 0.0265 0.0202 0.0154 0.0116 
PW-LR 0.0077 0.0047 0.0048 0.0015 0.0013 0.0013 
LR-PW 0.0052 0.0037 0.0045 0.0007 0.0007 0.0009 


Table 7.2. Variance comparison for estimators of Black-Scholes deltas and gammas 
using the likelihood ratio (LR) and pathwise (PW) methods and combinations of 


the two. 


Although generalizing from numerical results is risky (especially with re- 
sults based on such a simple example), we expect the superior performance of 
the mixed gamma estimators campared with the pure likelihood ratio method 
estimator to hold more generally. It holds, for example, in the more com- 
plicated application to the LIBOR market model in Glasserman and Zhao 
[150]. 

Yet another strategy for estimating second derivatives applies a finite- 
difference approximation using two estimates of first derivatives. Glasserman 
and Zhao [150] report examples in which the best estimator of this type has 
smaller root mean square error than a mixed PW-LR estimator. A difficulty 
in this method as in any use of finite differences is finding an effective choice 


of parameter increment h. 


7.3.4 Approximations and Related Methods 


The main limitations on the use of the likelihood ratio method are (i) the 
method’s reliance on explicit probability densities, and (ii) the large variance 
sometimes produced by the score. This section describes approximations and 
related methods that can in some cases address these limitations. 


General Diffusion Processes 


Several of the examples with which we introduced the likelihood ratio method 
in Section 7.3.1 rely on the simplicity of probability densities associated with 
geometric Brownian motion. In more complicated models, we rarely have ex- 
plicit expressions for either marginal or transition densities, even in cases 
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where these are known to exist. By the same token, in working with more 
complicated models we are often compelled to simulate approximations (us- 
ing, e.g., the discretization methods of Chapter 6), and we may be able to 
develop derivative estimators for the approximating processes even if we can- 
not for the original process. 

Consider, for example, a general diffusion of the type in (7.23). An Euler 
approximation to the process is a special case of the general recursion in 
(7.43) in which the driving random variables are normally distributed. In fact, 
whereas we seldom know the transition law of a general diffusion (especially 
in multiple dimensions), the Euler approximation has a Gaussian transition 
law and thus lends itself to use of the likelihood ratio method. If the method 
is unbiased for the Euler approximation, it does not introduce any discretiza- 
tion error beyond that already present in the Euler scheme. (Similar though 
more complicated techniques can potentially be developed using higher-order 
discretization methods; for example, in the second-order scheme (6.36), the 
state transition can be represented as a linear transformation of a noncentral 
chi-square random variable.) 

We illustrate this idea through the Heston [179] stochastic volatility model, 

using the notation of Example 6.2.2. We consider an Euler approximation 


A 


S(i+1)=(1+rh)$(i) + SO V (hZ (i+ 1) 
a E E e E (oz GAD Sei eBos+ 1)) 


with time step h and independent standard normal variables (Z: (i), Z2(2)), 
t = 1,2,.... The conditional distribution at step i + 1 given the state at the 


ith step is normal, 


(1 + rh)S$(i) S2(i)V(i)h pos 
a (ee + «(0 — A i eo o 


Assume a fixed initial state (S(0), V(0)) = (S(0), V(0)) and |p| 4 1. 
Consider the estimation of delta, a derivative with respect to S(O). 
Through the argument in Example 7.3.2, it suffices to consider the dependence 
of ($(1), V(1)) on S(0). Because S(0) is not a parameter of the distribution of 
V (1), it actually suffices to consider Ŝ(1). Observe that S(0) appears in both 
the mean and variance of S(1), so we need to combine contributions from 
(7.38) and (7.40). After some algebraic simplification, the score becomes 


A(lj(l+rh) Z2(1)-1 
S(0).\/V (0)h S(O)? 


with Z,(1) the standard normal random variable used to generate $(1) in the 


Euler scheme. 
Next consider estimation of sensitivity to the parameter ø. Because o 


is a parameter of the transition law of the Euler scheme, every transition 


a (747) 
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contributes a term to the score. For a generic transition out of a state (s i V), 
we use (7.40) with 


ie SVVh 0 = ( 0 soning 
po V Vh oy/VA(1 — p?) pSVh 20Vh } 


For the ith transition, (7.40) then simplifies to 


ZÒ -1 | pZali)Zali) oe 


o oy/1 — p2 


The score for a function of (S(i), V(i)), i = 1,...,m, is then the sum of (7.48) 
Ori e Leona ' 


LIBOR Market Model 


The application of the likelihood ratio method to the Heston stochastic volatil- 
ity model relies on the covariance matrix in (7.47) having full rank. But in an 
Euler scheme in which the dimension of the state vector exceeds the dimension 
of the driving Brownian motion, the covariance matrix of the transition law 
is singular. As explained in Section 7.3.2, this prevents a direct application of 
the likelihood ratio method. In LIBOR. market models, the number of forward 
rates in the state vector typically does exceed the dimension of the driving 
Brownian motion (the number of factors), making this a practical as well as 
a theoretical complication. 

One way to address this issue notes that even if the one-step transition 
law is singular, the k-step transition law may have a density for some k > 1. 
But a first attempt to use this observation raises another issue: in an Euler 
scheme for log forward rates (log Liges log Êm), as in (3.120), the one-step 
transition law is normal but the k-step transition law is not, for k > 1, because 
of the dependence of the drift term on the current level of rates. To get around 
this problem, Glasserman and Zhao [150] make the approximation in (7.27) 
under which the drift becomes a function of the initial rates rather than the 
current rates. Using this approximation and a fixed time step h, the k-step 
evolution of each log Ên takes the form 


k-1 
log Ln (k) = log Ln (0) + hd [up (ih) = 5 llon (ih) ||7] 
= m 
+Vb [on(0)T, on (h), -0al 1))"] | 
Zk 


where the row vectors on(ih)! have been concatenated into a single vector of 
length kd, the column vectors Z; have been stacked into a column vector of 
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the same length, and d is the number of factors. For sufficiently large k*, the 
M x k*d matrix 


o1(0)' olh)! --- o1((k* —1)h)! 
O9 T T9 Fo., To Ao T 
ag | POT a (ee =) 


aur iu ++ oy ((k* z 1)h)! 


may have rank M even if the number of factors d is smaller than the num- 
ber of rates M. Whether this occurs depends on how the volatilities a, (ih) 
change with i. If it does occur, then the covariance matrix A(k*)A(k*)! of 
the (approximate) k-step transition law is invertible and we may use formulas 
in Example 7.3.4 for an arbitrary Gaussian vector. 

For example, to apply this idea to estimate a sensitivity with respect to 


some L,;(0), we set 


X = (log £1(k*h),..., log Lys (k*h))" 


k*—1 
= log Ln (0) +h YO [AS (Eh) — Hlonlth)P] n= 1,.--, M, 
£=0 
l ae 
i = ny me fa obese. 


and È = as The (approximate) LRM estimator for the sensitivity 
of an arbitrary discounted payoff f(2i(i),..., La¢(z)), i > k*, is then 


Thiars oo ONA =a E a 


with a = (a1,...,am)' and å = (å1,...,åm)'. Glasserman and Zhao [150] 
test this method on payoffs with discontinuities (digital caplets and knock-out 
caplets) and find that it substantially outperforms finite-difference estimation. 
They compare methods based on the root mean square error achieved in a 
fixed amount of computing time. 


Mixed Estimators 


Pathwise differentiation and LRM estimation can be combined to take ad- 
vantage of the strengths of each approach. We illustrated one type of combi- 
nation in discussing gamma estimation in Section 7.3.3. There we used each 
method for one order of differentiation. An alternative type of combination 
uses the likelihood ratio method near a discontinuity and pathwise differenti- 
ation everywhere else. 

We illustrate this idea with an example from Fournié et al. [124]. The 
payoff of a digital option struck at K can be written as 


7.3 The Likelihood Ratio Method A17 


Liz > K} = fex) + (1{x > K} — fe(x)) 
= fez) + h.(z), 
with 
f(z) = min{1, max{0,2 — K + €}/2e}. 

The function fe makes a piecewise linear approximation to the step-function 
payoff of the digital option and he corrects the approximation. We can ap- 
ply pathwise differentiation to fe(S(T)) and the likelihood ratio method to 
h.(S(T)) to get the combined delta estimator 


1 ST), (S(T) 
z UST) -KI < doy HAlEI T 


assuming S(t) is a geometric Brownian motion and using the notation of 
Example 7.3.1. 

Figure 7.4 plots the variance of this estimator as a function of e with 
parameters S(0) = K = 100, ø = 0.30, r = 0.05, and T = 0.25. The case 
€ = 0 corresponds to using only the likelihood ratio method. A small € > 0 
increases variance, because of the € in the denominator of the estimator, but 
larger values of e can substantially reduce variance. The optimum occurs at a 
surprisingly large value of € ~ 35. 


.002 


Variance 
> 
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Fig. 7.4. Variance of mixed estimator as a function of linearization parameter e€. 


Rewriting the combined estimator as 


¢(S(2)) +( S(T) S(S(T)) 
S(T) > Kjo et (HIS) — KI < doza — lS Mayne 
S(O)oVT eS (0) S(0)oVT 
reveals an interpretation as a control variate estimator: the first term is the 
LRM estimator and the expression in parentheses has expected value zero. 
The implicit coefficient on the control is 1 and further variance reduction 


could be achieved by optimizing the coefficient. 
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Sensitivities to Calibration Instruments 


Avellaneda and Gamba [28] consider the problem of estimating sensitivites of 
derivative securities with respect to the prices of securities used to calibrate 
a model, which may or may not be the underlying assets in the usual sense. 
These sensitivities are relevant if the securities used to calibrate a model to 
market prices are also the securities used for hedging, as is often the case. 

The formulation in Avellaneda and Gamba [28] is based on weighted Monte 
Carlo (as in Section 4.5.2), but similar ideas apply in a simpler control variate 
setting. Denote by Y the discounted payoff of a derivative security and by X 
the discounted payoff of a hedging instrument. Suppose we estimate the price 
of the derivative security using an estimator of the form 


Y =Y —6(X -= ©), (7.49) 


with 8 interpreted as the slope of a regression line. The constant c is the 
observed market price of the hedging instrument, which may or may not 
equal E[X]. The difference E[X] — c represents model error; if this is nonzero, 
then (7.49) serves primarily to correct for model error rather than to reduce 
variance. The coefficient 8 is the sensitivity of the corrected price to the 
market price of the hedging instrument and thus admits an interpretation as 
a hedge ratio. This coefficient can be estimated by applying, e.g., ordinary 
least squares regression to simulated values of (X,Y). 

A link with regression can also be seen in the application of the likelihood 
ratio method with the normal distribution. Suppose X ~ N(6,0*) and sup- 
pose Y depends on 0 through X (for example, Y could be a function of X). 
As in (7.38), the score for estimating sensitivity to 0 is (X — 6)/o? and the 
corresponding likelihood ratio method estimator has expectation 


This is precisely the slope in a regression of Y against X. From this perspec- 
tive, using regression to estimate a sensitivity can be interpreted as using an 
approximate LRM estimator based on the normal distribution. 


7.4 Concluding Remarks 


Extensive numerical evidence accumulated across many models and applica- 
tions indicates that the pathwise method, when applicable, provides the best 
estimates of sensitivities. Compared with finite-difference methods, pathwise 
estimates require less computing time and they directly estimate derivatives 
rather than differences. Compared with the likelihood ratio method, pathwise 
estimates usually have smaller variance — often much smaller. 
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The application of the pathwise method requires interchanging the order 
of differentation and integration. Sufficient conditions for this interchange, tai- 
lored to option pricing applications, are provided by conditions (A1)—(A4) in 
Section 7.2.2. A simple rule of thumb states that the pathwise method yields 
an unbiased estimate of the derivative of an option price if the option’s dis- 
counted payoff is almost surely continuous in the parameter of differentiation. 
The discounted payoff is a stochastic quantity, so this rule of thumb requires 
continuity as the parameter varies with all random elements not depending 
on the parameter held fixed. (This should not be confused with continuity of 
the expected discounted payoff, which nearly always holds.) This rule excludes 
digital and barrier options, for example. 

Finite-difference approximations are easy to implement and would appear 
to require less careful examination of the dependence of a model on its pa- 
rameters. But when the pathwise method is inapplicable, finite-difference ap- 
proximations have large mean square errors: the lack of continuity that may 
preclude the use of the pathwise method produces a large variance in a finite- 
difference estimate that uses a small parameter increment; a large increment 
leads to a large bias. The size of an effective parameter increment — one min- 
imizing the mean square error — is very sensitive to the pathwise continuity 
of the discounted payoff. The question of continuity thus cannot be avoided 
by using finite-difference estimates. 

In contrast, the likelihood ratio method does not require any smoothness 
in the discounted payoff because it is based on differentiation of probabil- 
ity densities instead. This makes LRM potentially attractive in exactly the 
settings in which the pathwise method fails. The application of LRM is, how- 
ever, limited by two features: it requires explicit knowledge of the relevant 
probability densities, and its estimates often have large variance. 

The variance of LRM estimates becomes problematic when the para- 
meter of differentiation influences many of the random elements used to 
simulate a path. For example, consider the simulation of a discrete path 
S(0), S(t1), S(t2),... of geometric Brownian motion, and contrast the esti- 
mation of delta and vega. Only the distribution of S(t,) depends directly on 
S(0); given S(t,), the subsequent S(t;) are independent of S(0). But every 
transition depends on the volatility parameter o. So, the score used to esti- 
mate delta has just a single term, whereas the score used to estimate vega has 
as many terms as there are transitions. Summing a large number of terms in 
the score produces an LRM estimator with a large variance. 

Several authors (including Benhamou [44], Cvitanić et al. [94], Fournié et 
al. [124], and Gobet and Munos [161]) have proposed extensions of the like- 
lihood ratio method using ideas from Malliavin calculus. These techniques 
reduce to LRM when the relevant densities are available; otherwise, they re- 
place the score with a Skorohod integral, which is then computed numerically 
in the simulation itself. This integral may be viewed as a randomization of 
the score in the sense that the score is its conditional expectation. Further 
identities in this spirit are derived in Gobet and Munos [161], some leading to 


(  JUDULLIIALILY DeLSITIVItIEeS 


Fav 


variance reduction. Except in special cases, evaluation of the Skorohod integral 
is computationally demanding. Estimators that require it should therefore be 
compared with the simpler alternative of applying LRM to an Euler approx- 
imation, for which the (Gaussian) transition laws are explicitly available. 
Estimating second derivatives is fundamentally more difficult than esti- 
mating first derivatives, regardless of the method used. The pathwise method 
is generally inapplicable to second derivatives of option prices: a kink in an 
option payoff becomes a discontinuity in the derivative of the payoff. Combi- 
nations of the pathwise method and likelihood ratio method generally produce 
better gamma estimates than LRM alone. The two methods can also be com- 
bined to estimate first derivatives, with one method in effect serving as a 


control variate for the other. 


8 


Pricing American Options 


Whereas a European option can be exercised only at a fixed date, an Amer- 
ican option can be exercised any time up to its expiration. The value of an 
American option is the value achieved by exercising optimally. Finding this 
value entails finding the optimal exercise rule — by solving an optimal stop- 
ping problem — and computing the expected discounted payoff of the option 
under this rule. The embedded optimization problem makes this a difficult 
problem for simulation. 

This chapter presents several methods that address this problem and dis- 
cusses their strengths and weaknesses. The methods differ in the restrictions 
they impose on the number of exercise opportunities or the dimension of the 
underlying state vector, the information they require about the underlying 
processes, and the extent to which they seek to compute the exact price or 
just a reasonable approximation. Any general method for pricing American 
options by simulation requires substantial computational effort, and that is 
certainly the case for the methods we discuss here. 

A theme of this chapter is the importance of understanding the sources 
of high and low bias that affect all methods for pricing American options by 
simulation. High bias results from using information about the future in mak- 
ing the decision to exercise, and this in turn results from applying backward 
induction to simulated paths. Low bias results from following a suboptimal ex- 
ercise rule. Some methods mix the two sources of bias, but we will see that by 
separating them it is often possible to produce a pair of estimates straddling 
the optimal value. 


8.1 Problem Formulation 


A general class of continuous-time American option pricing problems can be 
formulated by specifying a process U(t), 0 < t < T, representing the dis- 
counted payoff from exercise at time t, and a class of admissible stopping 
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times 7 with values in [0,7]. The problem, then, is to find the optimal ex- 


pected discounted payoff 

sup E[U(r)}. 

TET 
An arbitrage argument justifies calling this the option price under appropriate 
regularity conditions; see Duffie [98]. 

The process U is commonly derived from more primitive elements. With 
little loss of generality, we restrict attention to problems that can be formu- 
lated through an #¢-valued Markov process {X(t),0 < t < T} recording all 
necessary information about relevant financial variables, such as the current 
prices of the underlying assets. The Markov property can often be achieved 
by augmenting the state vector to include supplementary variables (such as 
stochastic volatility), if necessary. The payoff to the option holder from exer- 
cise at time t is A(X (¢)) for some nonnegative payoff function h. If we further 
suppose the existence of an instantaneous short rate process {r(t),0 <t < T}, 
the pricing problem becomes calculation of 


sup E Jets PUR ah) | (8.1) 


It is implicit in the form of the discounting in this expression that the expecta- 
tion is taken with respect to the risk-neutral measure. By taking the discount 
factor to be a component of X, we could absorb the discounting into the func- 
tion h. In this Markovian setting, it is natural to take the admissible stopping 
rules 7 to be functions of the current state, augmenting the state vector if 
necessary. This means that the exercise decision at time t is determined by 
X (t): 

This formulation includes the classical American put as a special case. 
Consider a put struck at K on a single underlying asset S(t). The risk-neutral 
dynamics of S are modeled as geometric Brownian motion GBM(r, o”), with 
r a constant risk-free interest rate. Suppose the option expires at T. Its value 
at time 0 is then 

sup Ele" (K — S(r))"], (8.2) 
TET 
where the elements of 7 are stopping times (with respect to S) taking values 
in [0, T]. This supremum is achieved by an optimal stopping time 7* that has 


the form 
T* =inf{t > 0: S(t) < b*(t)}, (8.3) 


for some optimal ezercise boundary b*. This is illustrated in Figure 8.1. 

We have written the payoff in (8.2) as (K —S(r))? rather than (K — S(r)) 
so that exercising an out-of-the-money option produces a zero payoff rather 
than a negative payoff. This allows us to include the possibility that the option 
expires worthless within the event {7 = T} rather than writing, e.g., T = oo 
for this case. For this reason, in (8.1) and throughout this chapter we take the 
payoff function h to be nonnegative. 
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Fig. 8.1. Exercise boundary for American put with payoff (K — S(t))*. The option 
is exercised at 7, the first time the underlying asset reaches the boundary. 


In discussing simulation methods for pricing American options, we restrict 
ourselves to options that can be exercised only at a fixed set of exercise op- 
portunities tj < tg <---< tm. In many cases, such a restriction is part of the 
option contract, and such options are often called “Bermudan” because they 
lie between European and continuously exercisable American options. In some 
cases, it is known in advance that exercise is suboptimal at all but a finite 
set of dates, such as dividend payment dates. One may alternatively view the 
restriction to a finite set of exercise dates as an approximation to a contract 
allowing continuous exercise, in which case one would want to consider the 
effect of letting m increase to infinity. Because even the finite-exercise problem 
poses a significant challenge to Monte Carlo, we focus exclusively on this case. 
We construe “American” to include both types of options. 

To reduce notation, throughout this chapter we write X (ti) as X;; this is 
the state of the underlying Markov process at the ith exercise opportunity. 
The discrete-time process Xo = X(0),X1,...,Xm is then a Markov chain on 
RI, We discuss American option pricing based on simulation of this Markov 
chain. In practice, simulating Xi+ı from X; may entail simulating values of 
X(t) with t; < t < t;4,. For example, it may be necessary to apply a time 
discretization using a time step smaller than the intervals ti+ı — t; between 
exercise dates. In this case, the simulated values of X1,..., Xm are subject to 
discretization error. For the purposes of discussing American option pricing, 
we disregard this issue and focus on the challenge of solving the optimal 
stopping problem under the assumption that X1,..., Xm can be simulated 


exactly. 
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Dynamic Programming Formulation 


Working with a finite-exercise Markovian formulation lends itself to a charac- 
terization of the option value through dynamic programming. Let h; denote 
the payoff function for exercise at t;, which we now allow to depend on 3%. Let 
V;(z) denote the value of the option at t; given X; = x, assuming the option 
has not previously been exercised. This can also be interpreted as the value of 
a newly issued option at t; starting from state x. We are ultimately interested 
in Vo(Xo). This value is determined recursively as follows: 


Vm(2) = hm (x) i (8.4) 
Vi-1(2) = max{h;_1(2), E| Di-1 (Xi) Vi (X)| Xi—1 = x|}, (8.5) 
A E Is 


Here we have introduced the notation D;_1;(X;) for the discount factor from 
ti—1ı to t;. Equation (8.4) states that the option value at expiration is given 
by the payoff function Am; equation (8.5) states that at the (i — 1)th exercise 
date the option value is the maximum of the immediate exercise value and the 
expected present value of continuing. (We usually exclude the current time 0 
from the set of exercise opportunities; this can be accommodated in (8.5) by 
setting ho = 0.) 

Most methods for computing American option prices rely on the dynamic 
programming representation (8.4)-(8.5). This is certainly true of the binomial 
method in which the conditional expectation in (8.5) reduces to the average 
over the two successor nodes in the lattice. (See Figure 4.7 and the surrounding 
discussion.) Estimating these conditional expectations is the main difficulty 
in pricing American options by simulation. The difficulty is present even in 
the two-stage problem discussed in Example 1.1.3, and is compounded by the 
nesting of conditional expectations in (8.5) as i decreases from m to 1. 

By augmenting the state vector if necessary, we assume that the discount 
factor D;_1,; is a deterministic function of X;. The discount factor could have 


the form 
ti 
Dostal K) = exp (-/ r(u) au) ; 
ti-1 


but this is not essential. The more general formulation frees us from reliance on 
the risk-neutral measure. We simply require that the expectation is taken with 
respect to the probability measure consistent with the choice of numeraire im- 
plicit in the discount factor; see the discussion in Section 1.2.3. If we simulate 
under the ¢,,-maturity forward measure, the discount factor D;_1,; becomes 
By (ti-1)/Bm(t:), with Bm(t) denoting the time-t price of a bond maturing 
at tm. 

In discussing pricing algorithms in this chapter, we suppress explicit dis- 
counting and work with the following simplified dynamic programming recur- 


sion: 
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Vile) liaa) (8.6) 
Vi—ı (x) = max{h;_1(x), E(V;(X;)|Xe-1 = x] }. (8.7 


This formulation is actually sufficiently general to include (8.4)—(8.5), as we 


now explain. i 
In (8.4)-(8.5), each V; gives the value of the option in time-t; dollars, 


but we could alternatively formulate the problem so that the value is always 
recorded in time-0 dollars. Let Do,;(X,;) denote the discount factor from time 
0 to t;, which we suppose is a deterministic function of X; by augmenting the 
state vector if necessary. To be consistent with its interpretation as a discount 
factor, we require that Do,;(X,) be nonnegative and that it satisfy Doo = 1 
and Diza A a) aa) = Doi ( Xi). Now let 


and 


Then Vo = Vo and the Vi satisfy 


Vere) 
Vi-1(2) = Doi—1(2)Vi-1 (2) 
= Do i—1(x) max{hi-1 (£), E[Di-1 i (X) V(XÐ|Xi-1 = zl} 
= max{h;-1 (x), ElDo,s—1()Di-1,(Xi) Vi(Xi)|Xi_-1 = z]} 
max{h;_1(x), E[Vi(X;)|X;-1 = zl}. 


| 


Thus, the discounted values V; satisfy a dynamic programming recursion of 
exactly the form in (8.6)-(8.7). For the purpose of introducing simulation 
estimators, (8.7) is slightly simpler, though in practical implementation one 


usually uses (8.5). 


Stopping Rules 


The dynamic programming recursions (8.4)—(8.5) and (8.6)—(8.7) focus on 
option values, but it is also convenient to view the pricing problem through 
stopping rules and exercise regions. Any stopping time 7 (for the Markov 
chain Xo, X1,..., Xm) determines a value (in general suboptimal) through 


VAT (Xo) = Efh; (X+)]. (8.8) 


Conversely, any rule assigning a “value” V; (x) to each state x € RÊ and 
exercise opportunity i, with Vm = hm, determines a stopping rule 


7 = mini E {leem sg) = V;(X;)}. (8.9) 


The exercise region determined by V; at the ith exercise date is the set 
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{x : hy(x) > Vi(x)} (8.10) 


and the continuation region is the complement of this set. The stopping rule 
F can thus also be described as the first time the Markov chain X; enters an 
exercise region. The value determined by using 7 in (8.8) does not in general 
coincide with Vo, though the two would be equal if we started with the optimal 


value function. 


Continuation Values 


The continuation value of an American option with a finite number of exercise 
opportunities is the value of holding rather than exercising the option. The 
continuation value in state x at date t; (measured in time-0 dollars) is 


Ci(x) = E[Vi+1 (Xi+1)| Xi = z], (8.11) 
i = 0,...,m— 1. These also satisfy a dynamic programming recursion: Cm = 0 
and 
Ci(x) = E [max{hiy1(Xi41), Cii (Xi) } Xi = z], (8.12) 
for i = 0,...,m — 1. The option value is Co(Xo), the continuation value at 
time Q. | 


The value functions V; determine the continuation values through (8.11). 


Conversely, 
V; (a) = max{h, (x), Ci (x)}, 


i = 1,...,m,so the C; determine the V;. An approximation Ĉ; to the contin- 
uation values determines a stopping rule through 
7 =minfi€ {1,...,m}:hi( Xj) > Ci(X,)}. (8.13) 


This is the same as the stopping rule in (8.9) if the approximations satisfy 
Vi(x) = maxt{hi(xz),Ci(x)}. > 


8.2 Parametric Approximations 


Genuine pricing of American options entails solving the dynamic program- 
ming problem (8.4)—(8.5) or (8.6)—-(8.7). An alternative to trying to find the 
optimal value is to find the best value within a parametric class. This reduces 
the optimal stopping problem to a much more tractable finite-dimensional 
optimization problem. The reduction can be accomplished by considering a 
parametric class of exercise regions or a parametric class of stopping rules. 
The two approaches become equivalent through the correspondences in (8.8)— 
(8.10), so the distinction is primarily one of interpretation. 

The motivation for considering parametric exercise regions is particularly 
evident in the case of an option on a single underlying asset. The curved exer- 
cise boundary in Figure 8.1, for example, is well-approximated by a piecewise 
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linear function with three or four segments (or piecewise exponential as in Ju 
[205]), which could be specified with four or five parameters. The option value 
is usually not very sensitive to the exact position of the exercise boundary — 
the value is continuous across the boundary — suggesting that even a rough 
approximation to the boundary should produce a reasonable approximation 
to the optimal option value. 

The one-dimensional setting is, however, somewhat misleading. In higher- 
dimensional problems (where Monte Carlo becomes relevant), the optimal 
exercise region need not have a simple structure and there may be no evi- 
dent way of parameterizing a class of plausible approximations. Developing 
a parametric heuristic thus requires a deeper understanding of what drives 
early exercise. The financial interpretation of the optimal stopping problems 
that arise in pricing high-dimensional American options sometimes makes this 


possible. 
To formalize this approach, consider a class of stopping rules 7%, 0 € ©, 


with each tg € T and © a subset of some RY. Let 


Vi = sup E[r (X0) )]; 
oco 


the objective of a parametric heuristic is to estimate VÊ, the optimal value 
within the parametric class. Because the supremum in this definition is taken 
over a subset of all admissible stopping rules 7, we have 


Ve < Vo = sup E[A,(X,)], (8.14) 


TET 


typically with strict inequality. Thus, a consistent estimator of V underesti- 


mates the true option value. 
A meta-algorithm for estimation of V? consists of the following steps: 


Step 1: Simulate nı independent replications X9), j = 1,...,n1, of the 
Markov chain (Xo, X1,...,Xm); 
Step 2: Find 0 maximizing 


wa j eee ; 

6 (7) 
Vv To a A 
ony =) troat KOTOL 


where for each 0 € O, 74) (0) is the time of exercise on the jth repli- 
cation at parameter 0. 
Bias 


Assuming Step 2 can be executed, the result is an estimator that is biased 
high relative to Vf, in the sense that 


E[V!] > Ve. (8.15) 


428 8 Pricing American Options 


This simply states that the expected value of the maximum over 0 is at least 
as large as the maximum over @ of the expected values. This can be viewed 
as a consequence of Jensen’s inequality. It also results from the fact that the 
in-sample optimum ô implicitly uses information about the future evolution 
of the simulated replications in determining an exercise decision. 

The combination of the in-sample bias in (8.15) and the suboptimality bias 
in (8.14) produces an unpredictable bias in Ve. One might hope that the high 
bias in (8.15) offsets the low bias in (8.14), but without further examination 
of specific cases this conjecture lacks support. Given that bias is inevitable in 
this setting, it is preferable to control the bias and determine its direction. 
This can be accomplished by adding the following to the meta-algorithm: 


Step 3: Fix at value found in Step 2. Simulate n additional independent 
replications of the Markov chain using stopping rule 7; and compute 


the estimate 
ni tne 


nA 1 l 
aa E : (7) 
Yo. = no Ss hro) Aoa) 
j=nı +1 
Because the second set of replications is independent of the set used to 
determine 0, we now have 
E[Ve' |] = VÝ, 
which is the true value at parameter and cannot exceed Vo. Thus, taking 
the unconditional expectation we get 


E[VE] < Vo, 


from which we conclude that the estimator produced by Step 3 is biased low. 

This is an instance of a more general strategy developed in Broadie and 
Glasserman [65, 66] for separating sources of high and low bias, to which we 
return in Sections 8.3 and 8.5. 

Estimators like 6 defined as solutions to sample optimization problems 
have been studied extensively in other settings (including, for example, max- 
imum likelihood estimation). There is a large literature establishing the con- 
vergence of such estimators to the true optimum and also the convergence of 
optimal-value estimators like ve to the true optimum VË over ©. Some results 
of this type are presented in Chapter 6 of Rubinstein and Shapiro [313] and 
Chapter 7 of Serfling [326]. 


Optimization 


The main difficulty in implementing Steps 1-2, beyond selection of the class 
of parametric rules, lies in the optimization problem of Step 2. Andersen [12] 
considers exercise rules defined by a single threshold at each exercise date 
and thus reduces the optimization problem to a sequence of one-dimensional 
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searches. Fu and Hu [131] estimate derivatives with respect to parameters 
(through the pathwise method discussed in Section 7.2) and use these to search 
for optimal parameters. Garcia [134] uses a simplex method as in Press et al. 
[299] with the exercise region at each time step described by two parameters. 

The optimization problem in Step 2 ordinarily decomposes into m — 1 sub- 
problems, one for each exercise date except the last. This holds whenever the 
parameter vector 0 decomposes into m — 1 components with the ith compo- 
nent parameterizing the exercise region at t;. This decomposition can be used 
to search for an optimal parameter vector by optimizing sequentially from the 
(m — 1)th date to the first. 

In more detail, suppose 6 = (6),...,9m—1) with 6; parameterizing the 
exercise region at t;. Each 6; could itself be a vector. Now consider the follow- 
ing inductive procedure, applied to n independent replications of the Markov 
chain XO, Xi; TET Ai 


(2a) find the value Ôm-ı maximizing the average discounted payoff of the 
option over the n paths assuming exercise is possible only at the (m—1)th 
and mth dates; 

(2b) with ô; ...,9m—1 fixed, find the value Ô; maximizing the average dis- 
counted payoff of the option over the n paths, assuming exercise is pos- 
sible only at ¿— 1,4,...,m, and following the exercise policy determined 
by 6;,...,0m—1 at i... m — 1. 


If each 6; is a scalar, then this procedure reduces to a sequence of one- 
dimensional optimization procedures. With a finite number of paths, each of 
these is typically a nonsmooth optimization problem and is best solved using 
an iterative search rather than a derivative-based method. Andersen [12| uses 
a golden section procedure, as in Section 10.1 of Press [299]. 

There is no guarantee that (2a)-(2b) will produce an optimum for the 
original problem in Step 2. In the decomposition, each ĝ; is optimized over 
all paths whereas in the original optimization problem only a subset of paths 
would survive until the ¿th date. One way to address this issue would be to 
repeat steps (2a)—(2b) as follows: working backward from dates m — 1 to 1, 
update each 0; using parameter values from the previous iteration to determine 
the exercise decision at dates 1,...,7 — 1, and parameter values from the 
current iteration to determine the exercise decision at dates? + 1,...,m — 1. 
Even this does not, however, guarantee an optimal solution to the problem in 
Step 2. 

Andersen [12] uses an approach of the type in Steps 1-3 to value Bermudan 
swaptions in a LIBOR market model of the type discussed in Section 3.7. He 
considers threshold rules in which the option is exercised if some function of 
the state vector is above a threshold. He allows the threshold to vary with the 
exercise date and optimizes the vector of thresholds. In the notation of this 
section, each 0;,2 = 1,...,m-— 1, is scalar and denotes the threshold for the 


ith exercise opportunity. 
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The simplest rule Andersen [12] considers exercises the swaption if the 
value of the underlying swap is above a time-varying threshold. All the thresh- 
old rules he tests try to capture the idea that exercise should occur when the 
underlying swap has sufficient value. His numerical results indicate that sim- 
ple rules work well and that Bermudan swaption values are not very sensitive 
to the location of the exercise boundary in the parameterizations he uses. 

This example illustrates a feature of many American option pricing prob- 
lems. Although the problem is high-dimensional, it has a lot of structure, 
making approximate solution feasible using relatively simple exercise rules 
that tap into a financial understanding of what drives the exercise decision. 


Parametric Value Functions 


An alternative to specifying an approximate stopping rule or exercise regions 
uses a parametric approximation to the optimal value function. Although the 
two perspectives are ultimately equivalent, the interpretation and implemen- 
tation are sufficiently different to merit separate consideration. 

We work with the optimal continuation values C;(x) in (8.11) rather than 
the value function itself. Consider approximating each function C;(x) by a 
member C;(x,4;) of a parametric family of functions. For example, we might 
take 0; to be a vector with elements 6;;,...,0;,¢ and consider functions of the 


form 


M 
Cile, bi) = Ñ buyla), 
j=1 


for some set of basis functions Y1,..., Ym. Our objective is to choose the 
parameters 0; to approximate the recursion (8.12). 

Proceeding backward by induction, this entails choosing §m—1 so that 
Cirn—1(£, Öm-1) approximates E[hm(Xm)|Xm—1 = 2], with the conditional ex- 
pectation estimated from simulated paths of the Markov chain. Given values 
of 6.44, = Ort we choose 6; so that Ci (a, ĝ;) approximates 


E [max{hi (Xiz); Cii (Xiri, Gigs) }Oin1, Xi = J i 


again using simulated paths to estimate the conditional expectation. 
Applying this type of approach in practice involves several issues. Choos- 
ing a parametric family of approximating functions is a problem-dependent 
modeling issue; finding the optimal parameters and, especially, estimating the 
conditional expectations present computational challenges. We return to this 


approach in Section 8.6. 


8.3 Random Tree Methods 


Whereas the approximations of the previous section search for the best so- 
lution within a parametric family, the random tree method of Broadie and 
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Glasserman [65] seeks to solve the full optimal stopping problem and estimate 
the genuine value of an American option. And whereas parametric approxi- 
mations rely on insight into the form of a good stopping rule, the method of 
this section assumes little more than the ability to simulate paths of the un- 
derlying Markov chain. With only minimal conditions, the method produces 
two consistent estimators, one biased high and one biased low, and both con- 
verging to the true value. This combination makes it possible to measure and 
control error as the computational effort increases. 

The main drawback of the random tree method is that its computational 
requirements grow exponentially in the number of exercise dates m, so the 
method is applicable only when m is small — not more than about 5, say. 
This substantially limits the scope of the method. Nevertheless, for problems 
with small m it is very effective, and it also serves to illustrate a theme of this 
chapter — managing sources of high and low bias. 

Before discussing the details of the method, we explain how a combina- 
tion of two biased estimators can be nearly as effective as a single unbiased 
estimator. Suppose, then, that Vn (b) and #,(b) are each sample means of n 
independent replications, for each value of a simulation parameter b. Suppose 
that as estimators of some Vo they are biased high and low, respectively, in 


the sense that 
E[V;,(b)] > Vo > Elon (b)]. (8.16) 


Suppose that, for some halfwidth H,,(b), 


A 


Vn(b) + Halb) 


is a valid 95% confidence interval for E[V;,(b)] in the sense that the interval 
contains this point with 95% probability; and suppose that 


On (b) + Ly (bd) 


is similarly a valid 95% confidence interval for Efô» (b)]. Then by taking the 
lower confidence limit of the low estimator and the upper confidence interval 
of the high estimator, we get an interval 


(tn (©) — Ln (b), Vn(b) + Hn (b) ) (8.17) 


containing the unknown value Vo with probability at least 90% (at least 95% 
if V, (b) and ôn (b) are symmetric about their means). Thus, we can produce a 
valid (though potentially very conservative) confidence interval by combining 
the two estimators; see Figure 8.2. In our application of this idea to the random 
tree method, the inequalities in (8.16) become equalities as b — oo and the 
interval halfwidths H,,(b) and Ln(b) shrink to zero as n — oo. The interval 
in (8.17) can thus be made to shrink to the point Vo in the limit as the 


computational effort grows. 
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Fig. 8.2. Combining and high and low estimators to form a confidence interval. 


8.3.1 High Estimator 


As its name suggests, the random tree method is based on simulating a tree of 
paths of the underlying Markov chain Xo, X1,..., Xm. Fix a branching para- 
meter b > 2. From the initial state Xo, simulate b independent successor states 
Xi,...,X? all having the law of X,. From each X?, simulate b independent 
successors X3!,...,X4° from the conditional law of X2 given X; = X?. From 
each e generate b successors A nies aA Vee and so on. Figure 8.3 
shows an example with m = 2 and b = 3. At the mth time step there are b™ 
nodes, and this is the source of the exponential growth in the computational 
cost of the method. oe 

We denote a generic node in the tree at time step i by X72". The 
superscript indicates that this node is reached by following the j;th branch 
out of Xo, the joth branch out of the next node, and so on. Although it is not 
essential that the branching parameter remain fixed across time steps, this is 
a convenient simplification in discussing the method. 

It should be noted that the random tree construction differs from what is 
sometimes called a “nonrecombining” tree in that successor nodes are sampled 
randomly. In a more standard nonrecombining tree, the placement of the nodes 
is deterministic, as it is in a binomial lattice; see, for example, Heath et al. 
[173]. 

From the random trée we define high and low estimators at each node by 
backward induction. We use the formulation in (8.6)—(8.7). Thus, h; is the 
discounted payoff function at the ¿th exercise date, and the discounted option 


value satisfies Vm = ee 
Vi(z) = max{hi(x), E[Vina (Xi+1)| X; = If, (8.18) 


1 =0,... = 1. oe 
Write V+ for the value of the high estimator at node X71. At the 


terminal nodes we set oo 3 
iim — Py (XII), (8.19) 


Working backward, we then set 


oie 


vie MD aes h; (XI sj 


b 
> ae (8.20) 
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Fig. 8.3. Illustration of random tree and high estimator for a call option with a 
single underlying asset and a strike price of 100. Labels at each node show the level 
of the underlying asset and (in brackets) the value of the high estimator. 


In other words, the high estimator is simply the result of applying ordinary 
dynamic programming to the random tree, assigning equal weight to each 
branch. Its calculation is illustrated in Figure 8.3 with h,;(x) = (a — 100)T. 

A simple induction argument demonstrates that the high estimator is in- 
deed biased high at every node, in the sense that 


SA Ke > VA (8.21) 


First observe that this holds (with equality) at every terminal node because 
of (8.19) and the fact that Vm = hm. Now we show that if (8.21) holds at ¿+1 


it holds at i. From (8.20) we get 


b 
Eye xP Hi) sE limy hy(X” D Jij big Ji 


IV 


b 
max hi( XII) E DD ie my ji 
j=l 
max T, E Do KAEN 
> max ae), E are ceca a 


= v(x) 


| 


The first equality uses (8.20), the next step applies Jensen’s inequality, the 
third step uses the fact that the 6 successors of each node are conditionally 
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i.i.d., the fourth step applies the induction hypothesis at 7+ 1, and the last 


step follows from (8.18). 
A similar induction argument establishes that under modest moment con- 


ditions, each Vv? ne converges in probability (and in norm) as b — o to 
the true value V;(X2!"*), given X27". This holds trivially at all terminal 
nodes because Vin is initialized to hm = Vm. The continuation value at the 
(m —1)th step is the average of i.i.d. replications and converges by the law of 
large numbers. This convergence extends to the option value — the greater 
of the immediate exercise and continuation values — by the continuity of the 
max operation. A key step in the induction is the “contraction” property 


| max(a, c1) — max(a,c2)| < |e1 — eal, (8.22) 


which, together with (8.18) and (8.20), gives 


ae 


b 
Ve VOLS 5 DRA — Ea OG IL 


This allows us to deduce convergence at step i from convergence at step 7+ 1. 
For details, see Theorem 1 of Broadie and Glasserman [65]. 

We are primarily interested in Vo, the estimate of the option price at the 
current time and state. Theorem 1 of [65] shows that if E[h?(X;)] is finite for 
all i = 1,...,m, then V converges in probability to the true value Vo(Xo); 
moreover, it is asymptotically unbiased in the sense that E[Vo] —> Vo(Xo). 

These properties hold as the branching parameter b increases to infinity. A 
simple way to compute a confidence interval fixes a value of b and replicates 
the random tree n times. (This is equivalent to generating nb branches out 
of Xo and b branches out of all subsequent nodes.) Let Vo(n,b) denote the 
sample mean of the n replications of Vo generated this way, and let sy(n, b) 
denote their sample standard deviation. Then with z 5/2 denoting the 1 — 6 /2 
quantile of the normal distribution, 


s b 

Vo(n, b) + ame 

provides an asymptotically valid (for large n) 1 — ô confidence interval for 
E[Vo]. This is half of what we need for the interval in (8.17). The next section 
provides the other half. 


8.3.2 Low Estimator 


The high bias of the high estimator may be attributed to its use of the same 
information in deciding whether to exercise as in estimating the continuation 
value. This is implicit in the dynamic programming recursion (8.20): the first 
term inside the maximum is the immediate-exercise value, the second term is 
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the estimated continuation value, and in choosing the maximum the estimator 
is deciding whether to exercise or continue. But the estimated continuation 
value is based on successor nodes, so the estimator is unfairly peeking into 
the future in making its decision. 

To remove this source of bias, we need to separate the exercise decision 
from the value received upon continuation. This is the key to removing high 
bias in all Monte Carlo methods for pricing American options. (See, for ex- 
ample, the explanation of Step 3 in Section 8.2.) There are several ways of 
accomplishing this in the random tree method. 

To simplify the discussion, consider the related problem of estimating 


max(a, E[Y]) 


from i.i.d. replications Y;,..., Y», for some constant a and random variable Y. 
This is a simplified version of the problem we face at each node in the tree, with 
a corresponding to the immediate exercise value and E[Y] the continuation 
value. The estimator max(a,Y), with Y the sample mean of the Yj, is biased 
high: 
E{max(a, Y)] > max(a, E[Y]) = max(a, E[Y]). 

This corresponds to the high estimator of the previous section. 

Suppose that we instead separate the Y; into two disjoint subsets and 
calculate their sample means Yı and Yz; these are independent of each other. 
Now set 


A Q, if Yı < a; 

Y5, otherwise. 
This estimator uses Yı to decide whether to “exercise,” and if it decides not 
to, it uses Y> to estimate the “continuation” value. Its expectation is 


E[ô] = P(Y, < aja + (1 — P(Y, < a))E[Y] < max(a, E[Y]), (8.23) 


so the estimator is indeed biased low. If a 4 E[Y], then P(Y, < a) — 1{E[Y] < 
a} and Efô] — max(a, E[Y]) as the number of replications used to calculate 
Yı increases. If the number of replications used to calculate Y2 also increases 
then ĉ — max(a, E{Y]). Thus, in this simplified setting we can easily produce 
a consistent estimator that is biased low. 

Broadie and Glasserman [65] use a slightly different estimator. They use all 
but one of the Y; to calculate Yı and use the remaining one for Yo; they then 
average the result over all b ways of leaving out one of the Y;. In more detail, 
the estimator is defined as follows. At all terminal nodes, set the estimator 
equal to the payoff at that node: 


At node 7172°--7; at time step i, and for each k = 1,...,6, set 


jada de pon MNIE pir Lijer age OT 7 S h(g) (8.24) 


Ü D , 
k ~Jija:Jik ne 
O47 otherwise; 
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then set 


ee = >> eee, (8.25) 
k=1 
The estimator of the option price at the current time and state is Uo. 

The calculation of the low estimator is illustrated in Figure 8.4. Consider 
the third node at the first exercise date. When we leave out the first successor 
we estimate a continuation value of (4+ 0)/2 = 2 so we exercise and get 5. If 
we leave out the second successor node we continue (because 7 > 5) and get 
4. In the third case we continue and get 0. Averaging the three payoffs 5, 4, 
and 0 yields a low estimate of 3 at that node. 
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Fig. 8.4. Labels at each node show the level of the underlying asset and the value 
of the low estimator in brackets. 


An induction argument similar to the one used for the high estimator in 
Section 8.3.1 and using the observation in (8.23) verifies that to is indeed 
biased low. Theorem 3 of Broadie and Glasserman [65] establishes the conver- 
gence in probability and in norm of ĉo to the true value Vo(Xo). 

From n independent replications of the random tree we can calculate the 
sample mean to(n, b) and sample standard deviation s,(n, b) of n independent 
replications of the low estimator. We can then form a 1 — 6 confidence interval 
for Ela], 

Uo(n, b) E 25/2 oa 


just as we did for the high estimator. Taking the lower limit for the low 
estimator and the upper limit for the high estimator, we get the interval 
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Sy(n,b) -= sy (n, b 
(20t, b) = za ZER, Gon b) + zap 2). 


For this interval to fail to include the true option value Vo(Xo), we must have 


Elĉo] < vo(n, b) — 2§/2 oe 


or 


Each of these events has probability 6/2 (for large n), so their union can 
have probability at most 6. We thus have a conservative confidence interval 
for Vo(Xo). Moreover, because E[Vo] and Efôo] both approach Vo(Xo) as b in- 
creases, we can make the confidence interval as tight as we want by increasing 
n and b. This simple technique for error control is a convenient feature of the 


method. 


8.3.3 Implementation 


A naive implementation of the random tree method generates all m? nodes 
(over m steps with branching parameter b) and then computes high and low 
estimators recursively as described in Sections 8.3.1 and 8.3.2. By noting that 
the high and low values at each node depend only on the subtree rooted at that 
node, we can dramatically reduce the storage requirements of the method. It 
is never necessary to store more than mb + 1 nodes at a time. 


Depth-First Processing 


The key to this reduction lies in depth-first generation and processing of the 
tree. Recall that we may label nodes in the tree through a string of indices 
jıj2 +- - Ji, each taking values in the set {1,...,b}. The string jj2--- Ji labels 
the node reached by following the 7;th branch out of the root node, then 
the joth branch out of the node reached at step 1, and so on. In the depth- 
first algorithm, we follow a single branch at a time rather than generating all 
branches simultaneously. 

Consider the case of a four-step tree. We begin by generating the following 


nodes: 

1, 11, 111, 1111. 
At this point we have reached the terminal step and can go no deeper, so we 
generate nodes 

Ii? ance LED, 
From these values, we can calculate high and low estimators at node 111. We 
may now discard all b successors of node 111. Next we generate 112 and its 


successors 
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1121; 1122; sss LI20: 


We discard these after using them to calculate high and low estimators at 
112. We repeat the process to calculate the estimators at nodes 113,..., 11b. 
These in turn can be discarded after we use them to calculate high and low 
estimators at node 11. We repeat the process to compute estimators at nodes 
12,...,1b to get estimators at node 1, and then to get estimators at nodes 
2,...,b, and finally at the root node. 

Four stages of this procedure are illustrated in a tree with m = 4 and 
b = 3 in Figure 8.5. The dashed lines indicate branches previously generated, 
processed, and discarded. Detailed pseudo-code for implementing this method 
is provided in Appendix C of Broadie and Glasserman [65]. 


Fig. 8.5. Depth-first processing of tree. Solid circles indicate nodes currently in 
memory; hollow circles indicate nodes previously processed and no longer stored. 


The maximal storage requirements of this method are attained in calcu- 
lating the value at node 6, at which point we are storing the values at nodes 
1,...,6—1. Just before determining the value at b, we need to know the val- 
ues at 61,..., 66. But just before determining the value at bb we need to store 
b1,...,bb—1 and bb1,...,6bb. And just before processing bbb we also need to 
store bbb1,...,bbbb. We thus need to store up to b nodes at every time step 
plus the root node, leading to a total of mb + 1 nodes. 


Pruning and Variance Reduction 


Broadie et al. [69] investigate potential enhancements of the random tree 
method, including the use of variance reduction techniques in combination 
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with a pruning technique for reducing the computational burden of the 
method. Their pruning technique is based on the observation that branch- 
ing is needed only where the optimal exercise decision is unknown. If we know 
it is optimal not to exercise at a node, than it would suffice to generate a sin- 
gle branch out of that node. When we work backward through the tree, the 
value we assign to that node for both the high and low estimators is simply 
the (discounted) value of the corresponding estimator at the unique successor 
node. This is what we do implicitly in an ordinary simulation to price a Eu- 
ropean option, where we never have the choice to exercise early. By pruning 
branches, we reduce the time needed to calculate estimators from a tree. 

But how can we determine that the optimal decision at a node is to con- 
tinue? Broadie et al. [69] suggest the use of bounds. Suppose, as we have in 
Section 8.1, that the payoff functions h;, 7 =1,...,m, are nonnegative. Then 
at any node at which the payoff from immediate exercise is 0, it is optimal to 
continue. This simple rule is often applicable at a large number of nodes. 

European prices, when computable, can also be used as bounds. Consider 
an arbitrary node X; = X/'"”" at the ith exercise date. Consider a European 
option expiring at tm with discounted payoff function Am. Suppose the value 
of this European option at node X; is given by g(X;). This, then, is a lower 
bound on the value of the American option. If g(X;) exceeds h;(X;), then the 
value of the American option must also exceed the immediate exercise value 
h;(X;) and it is therefore optimal to continue. The same argument applies if 
the European option expires at one of the intermediate dates tj41,...,tm-—1. 

When European counterparts of American options can be priced quickly 
(using either a formula or a deterministic numerical method), the last step 
of the tree can be completely pruned. At the (m — 1)th exercise date, the 
value of the American option is the maximum of the immediate exercise value 
and the value of a European option expiring at tm. If the European value is 
easily computable, there is no need to simulate the tree from t,,_; to tm. This 
reduces the size of the tree by a factor of b. 

Broadie et al. [69] also discuss the use of antithetic variates and Latin 
hypercube sampling in generating branches. In calculating the low estima- 
tor with antithetics, they apply the steps in (8.24)—(8.25) to averages over 
antithetic pairs. To apply Latin hypercube sampling, they generate two inde- 
pendent Latin hypercube samples at each node, compute low estimators from 
each set, and then combine the two by applying (8.24)-(8.25) with b = 2 to 
the two low estimators. 

Figure 8.6, based on results reported in Broadie et al. [69], illustrates 
the performance of the method and the enhancements. The results displayed 
are for an American option on the maximum of two underlying assets. ‘This 
allows comparison with a value computed from a two-dimensional binomial 
lattice. The assets follow a bivariate geometric Brownian motion with a 0.30 
correlation parameter, 20% volatilities, 10% dividend yields, and initial values 
(90,90), (100, 100), or (110,110). The risk-free rate is 5%. The option has a 
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strike price of 100 and can be exercised at time 0, 1, 2, or 3, with the time 
between exercise opportunities equal to one year. 


2% 
1% 


0% 


Relative Error 


-1% 


-2% 


90 100 110 
Starting Value 


Fig. 8.6. Relative error in price estimates for an American option on the maximum 
of two underlying assets. Each interval shows high and low estimates and their 
midpoint. Shorter intervals use pruning and antithetic branching. 


The results displayed in Figure 8.6 apply to the initial values 90, 100, and 
110 of the initial assets, for which the option prices are 7.234, 12.412, and 
19.059. The vertical scale is measured in percent deviations from these values. 
Within each of the three cases, the second confidence interval is computed 
using pruning, antithetic branching, and a European option as a control vari- 
ate, whereas the first confidence interval uses only the control variate. Within 
each interval, the two squares indicate the high and low estimators and the 
diamond indicates their midpoint. The use of the midpoint as point estimator 
is somewhat arbitrary but effective. The confidence intervals are based on a 
nominal coverage of 90% (z5/2 = 1.645), but because these are conservative 
intervals numerical tests indicate that the actual coverage is much higher. 

The results displayed are based on branching parameter 6 = 50. The num- 
ber of replications used to estimate standard errors is set so that computing 
times are the same for all cases. Because pruning can substantially decrease 
the computing time required per tree, it allows a much larger number of repli- 
cations. ‘The first interval in each case uses about one hundred replications 
whereas the second interval in each case uses several thousand. The results 
indicate that for problems of this size the random tree method can readily 
produce estimates accurate to within about 1%, and that the proposed en- 
hancements can be quite effective in increasing precision. 

More extensive numerical tests are reported in [65] and [69], including 
tests on problems with five underlying assets for which no “true” value is 
available. The ability to compute a confidence interval is particularly valuable 
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in such settings. These results support the view that the random tree method 
is an attractive technique for problems with a high-dimensional state vector 
and a small number of exercise opportunities. Its ability to produce a reliable 
interval estimate sets it apart from heuristic approximations. 


8.4 State-Space Partitioning 


State-space partitioning (called stratification by Barraquand and Martineau 
[38], quantization by Bally and Pagès [33]) uses a finite-state dynamic pro- 
gram to approximate the value of an American option. Whereas the dynamic 
program in the random tree method is based on randomly sampled states, in 
this method the states are defined in advance based on a partitioning of the 
state space of the underlying Markov chain Xo, X1,...,Xm.- 

We continue to use the notation of Section 8.1 and the dynamic program- 
ming formulation in (8.6)—(8.7). For each exercise date t;, i = 1,...,m, let 
Aj1,..-,Ai, be a partition of the state space of X; into b; subsets. For the 
initial time 0, take bọ = 1 and Agi = {Xo}. Define transition probabilities 


Pip = P(Xig1 € Aiti k| Xi € Aiz), 


for all j = 1,...,0;, k = 1,...,bi+1, and i = 0,...,m — 1. (Take this to be 
zero if P(X; € Aij) = 0.) These are not transition probabilities for X in the 
Markovian sense (the probability that X;+ı will fall in A;i+1,% given X; € Aj; 
will in general depend on past values of the chain), but we may nevertheless 
use them to define an approximating dynamic program. 

For eachz = 1,...,m and 7=1,...,6;, define 


hij = Elhe(Xi)|Xi € Aiz), 


taking this to be zero if P(X; € A;;) = 0. Now consider the backward induc- 


tion 
bi+ı 


Vij = max{hij, XO pip Vitir} (8.26) 
k=1 
i = 0,1,...,m — 1, j} =1,...,6;, initialized with Vm; = hm;. This method 
takes the value Vo; calculated through (8.26) as an approximation to Vo (Xo), 
the value in the original dynamic program. 

Implementation of this method requires calculation of the transition prob- 
abilities pi , and averaged payoffs hij, and this is where simulation is useful. 
We simulate a reasonably large number of replications of the Markov chain 
Xo,X1,...,Xm and estimate these quantities from the simulation. In more 
detail, we record N nes the number of paths that move from A;; to Ai+1,k, for 
alla =0,...,m—1,7 =1,...,6;, and k = 1,...,0;,1. We then calculate the 


estimates | | | | 
Din = Nie/ (Nj t+ + Nj.) 
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taking the ratio to be zero whenever the denominator is zero. Similarly, we 
calculate hi; as the average value of h(X;) over those replications in which 
X; € Aas 

Using these estimates of the transition probabilities and average payoffs, 
we can calculate estimates Vij of the approximating value function. We set 


Ven j= bmg for all 7 = 1,...,6m, and then recursively define 
bj41 
Vij =x max{ hij, X Pik Virik h 
k=1 
for j =1,...,6;,7 = 0,1,...,m — 1. Our estimate of the approximate option 


value Vo; is then Voi. 

By the strong law of large numbers, each pi , and his converges with prob- 
ability 1 to the corresponding Pox and h;i; as the number of replications in- 
creases. Moreover, the mapping from the transition probabilities and average 
payoffs to the value functions is continuous (because max, addition, and mul- 
tiplication are continuous), so each V;; converges to V;; with probability 1 as 
well. The simulation procedure thus gives a strongly consistent estimator of 
Vox. 

This, however, says nothing about the relation between the approximation 
Vo1 and the true option value Vo(Xo). For any finite number of replications, 
the induction argument used to prove (8.21) in the random tree method shows 


that . 
E[Voi] > Vou, 


but the sign of the error Voi — Vo( Xo) is unpredictable. 

By adding a second simulation phase to the procedure, we can produce 
an estimate that is guaranteed to be biased low, relative to the true value 
Vo(Xo). The idea — similar to one we used in Section 8.2 — is to turn the 
approximation V;; into an implementable stopping policy. The option value 
thus produced is not just an ad hoc approximation; it is a value achievable by 
the option holder by following a well-specified exercise policy. 

For the zth state X; of the underlying Markov chain, define J; to be the 
index of the subset containing X;: 


X; € Agi. 
To each path Xo, X1,..., Xm, associate a stopping time 
T= min{2 : hi(Xi) > Via 


defining 7 to be m if the inequality is never satisfied. This is the stopping rule 
defined by the approximate value function V;;. Because no stopping rule can 
be better than an optimal stopping rule, we have 


E[h,(X,)| < Vo (Xo), 


8.5 Stochastic Mesh Methods 443 


which says that h,(X,) is biased low. 

If the V;; were known, we could simulate replications of the Markov chain, 
record h,(X;) on each path, and average over paths. This would produce 
a consistent estimator of E[h,(X,)]. The quantity E[h-(X-,)] does not entail 
any approximations not already present in the V;;, and it has the advantage 
of being an achievable value: at each exercise date, the holder of the option 
could observe X; and compare h;(X;) with Vig; to determine whether or not 
to stop. 

Similar comments apply even if we rely on the estimates Vi; rather than 
true values Vij. Define 7 by replacing Vij, with Vj, in the definition of r. 
This, too, is an implementable and suboptimal stopping rule so 


Ehi (X)| Vi, j = 1,. T , 55,2 = 1, ea , m] = Vo(Xo), 


and the inequality then also holds for the unconditional expectation. The 
conditional expectation on the left is the quantity to which the procedure 
converges as the number of second-phase replications increases. 

The main challenge in using any variant of this approach lies in the se- 
lection of the state-space partitions. Bally and Pagès [33] discuss criteria for 
“optimal” partitions and propose simulation-based procedures for their con- 
struction. These appear to be computationally demanding. The effort might 
be justified if a partition, once constructed, could be applied to price many 
different American options. This, however, would not lend itself to tailoring 
the partition to the form of the payoff. 

It is natural to expect that as the resolution of the partitions increases, the 
option value produced by this approach converges to the true value; see The- 
orem 1 of Bally and Pagès [33] for a precise statement. But methods that rely 
on refining a priori partitions are not well-suited to high-dimensional state 
spaces, just as deterministic numerical integration procedures are not well- 
suited to high-dimensional problems. In our discussion of stratified sampling 
we noted (cf. Example 4.3.4) that the number of strata required to maintain 
the same resolution along all dimensions grows exponentially with the num- 
ber of dimensions. In the absence of problem-specific information, state-space 
partitioning exhibits similar dependence on dimension. 


8.5 Stochastic Mesh Methods 


8.5.1 General Framework 


Like the random tree algorithm of Section 8.3, the stochastic mesh method 
solves a randomly sampled dynamic programming problem to approximate 
the price of an American option. The key distinction is that in valuing the 
option at a node at time step 7, the mesh uses values from all nodes at time 
step i +1, not just those that are successors of the current node. This in fact 
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is why it produces a mesh rather than a tree. It keeps the number of nodes at 
each time step fixed, avoiding the exponential growth characteristic of a tree. 

A general construction is illustrated in Figure 8.7. In the first phase, we 
simulate independent paths of the Markov chain Xo, X1,..., Xm; in the sec- 
ond phase, we “forget” which node at time ¿ generated which node at i+ 1 
and interconnect all nodes at consecutive time steps for the backward induc- 
tion. We will consider other mechanisms for generating the nodes at each time 
step, but this is the most important one. We refer to it as the independent-path 


construction. 


Fig. 8.7. Construction of stochastic mesh from independent paths. Nodes are gen- 
erated by independent paths (top); weight W;, is then assigned to a transition from 
the jth node at step i to the kth node at step i + 1 (bottom). 


For the pricing problem, we use the dynamic programming formulation 
and notation of (8.6)~(8.7). In the mesh, we use X;; to denote the jth node at 


the ith exercise date, for t = 1,...,m and 7=1,...,b. We use Vi; to denote 
the estimated value at this node, computed as follows. At the terminal nodes 
we set Ving = Am(Xm;); we then work backward recursively by defining 


b 
A 1 . A 
k=] 


for some set of weights W},. At the root node, we set 
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1 b 
=) n (8.28) 


or the maximum of this and ho(Xo) if we want to allow exercise at time 0. 

A fundamental distinction between (8.27) and the superficially similar re- 
cursion in (8.26) is that (8.27) evolves over randomly sampled nodes, whereas 
(8.26) evolves over fixed subsets of the state space. 

As we have already seen several algorithms with a structure similar to 
(8.27), the main issue we need to address is the selection of the weights Uk 
A closely related issue is the sampling of the nodes X;;. The independent-path 
construction in Figure 8.7 provides one mechanism but by no means the only 
one. We could, for example, generate b nodes at each time step 2 by drawing 
independent samples from the marginal distribution of X;. Different rules for 
sampling the nodes and selecting the weights correspond to different versions 
of the method. 

The formulation of the stochastic mesh method we give here is a bit more 
general than the one originally introduced by Broadie and Glasserman [66]. 
The advantage of a more general formulation is that it allows a unified treat- 
ment of related techniques, clarifying which features distinguish the methods 
and which features are shared. We postpone discussion of the detailed con- 
struction of weights — the central issue — to Sections 8.5.2 and 8.6.2 and 
proceed with the general formulation. 


Conditions on the Mesh 


We first impose two conditions on the mesh construction and weights that 
are unlikely to exclude any cases of practical interest, then add a third, more 
restrictive condition. To state these, let 


X; = (Xii e.: , Xib) 


denote the “mesh state” at step 7 consisting of all nodes at step 2, for 
i = 1,...,m, and let Xp = Xo. We assume that the mesh construction is 


Markovian in the following sense: 
(M1) {Xo,...,X;-1} and {Xj41,..., Xm} are independent given X;, for all 
t= esn me: 

This condition is satisfied by the independent-path construction, because 
in that case {X1;,...,Xmj;}, j = 1,...,b, are independent copies of the 
Markov chain. It is also satisfied if nodes at different exercise dates are gen- 
erated independently of each other. We also require 


(M2) Each weight Wir is a deterministic function of X; and Xj. 


This includes as a special case the possibility that Wi is a function only 
of Xi; and Aiii 
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Recall from (8.11) that C;(x) denotes the continuation value in state x 
at time 7. Our next condition restricts the choice of weights to those that 
correctly estimate continuation values, on average: 


(M3) For all i = 1,...,m — 1 and all j = 1,...,, 


b 
1 
a Wip Viti (Xizir )| Ki] = Ci(Xiy). 
This says that if we knew the true option values at time ¿ + 1, the expected 
weighted average calculated at a node at time i would be the true continuation 


value. 


High Bias 


As a first implication of (M1)—(M3), we show that these conditions imply that 

the mesh estimator Vo defined by (8.27)—(8.28) is biased high. The argument 

is similar to the one we used for the random tree algorithm in Section 8.3.1 

but, because of the mesh weights, sufficiently different to merit presentation. 
We show that if, for some 7, 7 


E[V +1 5 |X] > Vig Xia): j m 1, ear) b, (8.29) 


then the same holds for all smaller 7. Once this is established, noting that 
(8.29) holds (with equality) at i = m — 1 completes the induction argument. 
Using (8.27) and Jensen’s inequality, we get 


b 

A 1 

E[V;;|X;] 2 max g Xij), b ` E[W [W Vi, k|Xi I} (8.30) 
k=1 


We examine the conditional expectations on the right. By further conditioning 
on X;4; and using (M2), we get 


E[W3,,Vi+1,6(Xs, Xi] = Wh E[Vi+1, k| Xi, Kit] 
= Wh E[Vi+1,0|Xe+1]; (8.31) 


the second equality following from (M1) and the definition of Vi+1,k through 
the backward induction in (8.27). Applying the induction hypothesis (8.29) 
to (8.31), we get 


EW}, Vi+1,el| Xi, Xiz] > Wi Viti(Xi+1,k), 
from which follows 
1 ieee 
5 ye [Wii Vi+1 0X] > a5 > EW 3 Vis (X41 4) Xe] 
k=]1 k=1 


0100.) 
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using (M3). Applying this inequality to (8.30), we conclude that 


which is what we needed to show. 

This induction argument offers some insight into the types of conditions 
needed to ensure convergence of the mesh values Vi; to the true values V;(X;), 
given X;;. Applying the contraction property (8.22) and (M3) to the mesh 
recursion (8.27) yields the bound 


b 
1 2 
Vig — Vi(Xag)] < [> XO Wi Vine — ELW Vit (Xi41,4) Xi] 
k=1 


b 


T $ Wie Viti (Xiri) — [Wii Viti Xi+1,0) |X] 
=1 


S b 
k 


+ 


b 
1 4% 
" ` Wig VLR i Vi+1(Xi+1,k)] 
k=1 


To require that the first term on the right side of the last inequality vanish is 
to require that the summands satisfy a law of large numbers. The second term 
requires a sufficiently strong induction hypothesis on the convergence of the 
mesh estimate at time i+ 1. Because Vinj = Vin(Xmj) at all terminal nodes 
for all 6, it is natural to proceed backward by induction. 

Broadie and Glasserman [66] use these observations to prove convergence 
of the mesh estimator when Xj ,...,X,, are independent of each other, 
Xil, X72,... are i.i.d. for each i, and each weight Wir is a function only of X;; 
and X;41,,. The independence assumptions facilitate the application of laws 
of large numbers. In the more general cases encompassed by conditions (M1)- 
(M2), the problem is complicated by the dependence between nodes and the 
generality of the weights. Avramidis and Matzinger [29] derive a probabilis- 
tic upper bound on the error in the mesh estimator for a type of dependence 
structure that fits within conditions (M1)-(M2). They use this bound to prove 
convergence as b — oo. 

Rust [814] proves convergence of a related method for a general class of 
dynamic programming problems, not specifically focused on American options 
or optimal stopping. In his method, the {X;;,7 = 1,...,b} are independent 
and uniformly distributed over a compact set, and the same nodes are used 
for all 2. 


Low Estimator 


Broadie and Glasserman [66] supplement the high-biased mesh estimator with 
a low-biased estimator. Their low estimator uses a stopping rule defined by 
the mesh. 
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To define the stopping rule, we need to extend the weights Wi, from 
Xil, .--, Xip to all points in the state space at time 7. The details of how we 
do this will be evident once we consider specific choices of weights; for now, 
let us simply suppose that we have a function W;}(-) on the state space at time 
i. We interpret Wf (x) as the weight from state x at time i to node Xi+1,k. 

Through this weight function, the mesh defines a continuation value 
throughout the state space, and not just at the nodes in the mesh. The con- 


tinuation value at state x at time i, i = 1,...,m — 1, is given by 
C;(x) = T Wi (s)V; (8.32) 
i b 2. k (£) Vi+1,k- 
If we impose the reasonable requirement that Wi(Xi;) = Wj, (so that the 


weight function does in fact extend the original weights), then C;(Xi;) coin- 
cides with the continuation value estimated by the mesh at node X;;, and C. 
interpolates from the original nodes to the rest of the state space. Set C20. 

With the mesh held fixed, we now simulate a path Xo, X1,..., Xm of the 
underlying Markov chain, independent of the paths used to construct the 
mesh. Define a stopping time 


# = min{i: hy(X,;) > C;(X;)}; (8.33) 


this is the first time the immediate exercise value is as great as the estimated 
continuation value. The low estimator for a single path is 


i = he(X>), (8.34) 


the payoff from stopping at 7. 
That this is indeed biased low follows from the observation that no policy 


can be better than an optimal policy. Conditioning on the mesh fixes the 
stopping rule and yields 


E[a|X1,..., Xm] < Vo(Xo). 


The same then applies to the unconditional expectation. 

By simulating multiple paths independently, with each following the mesh- 
defined stopping rule, we can calculate an average low estimator conditional on 
the mesh. We can then generate independent copies of the mesh and calculate 
high and low estimators from each copy. From these independent replications 
of the high and low estimators we can calculate sample means and standard 
deviations to form a confidence interval for each estimator. The two intervals 
can then be combined as in Figure 8.2. Assuming independently generated 
nodes in the mesh, Broadie and Glasserman [66] give conditions under which 
the low estimator is asymptotically unbiased, meaning that E[t] — Vo(Xo) 
as b — oo in the mesh. When this holds, the combined confidence interval 
shrinks to the point Vo(Xo) if we let b — oo and then n —> oo. 
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An Interleaving Estimator 


At several points in this chapter we have noted two sources of bias affecting 
simulation estimates of American option prices: high bias resulting (through 
Jensen’s inequality) from applying backward induction over a finite set of 
paths, and low bias resulting from suboptimal exercise. The development of 
high and low estimators in this section keeps the two sources of bias separate, 
first applying backward induction and then applying a mesh-defined stopping 
rule. But blending the two techniques has the potential to produce a more 
accurate value by not compounding each source of bias. We present one such 
combination from the method of Longstaff and Schwartz [241], though they 
present it in a different setting. 


Suppose value estimates Vax P Ving j = 1,...,06, have already been 
determined for steps i + 1,...,m, starting with the initialization ae = 


hl ha): These determine estimated continuation values: 


Cean Eea OG, Extending the weights throughout the state space 

as before then defines continuation values C;(x) for every state x at time £. 
To assign a value to a node X;; at step i, we consider two cases. If the 

immediate exercise value h;(X;,;) is at least as great as the estimated contin- 


uation value Ci then we exercise at node X;; and thus set 
Vig = hi( Xij); 


this coincides with the high estimator (8.27). If, however, hy(Xi;) < Ĉij, then 
rather than assign the high value Ĉ;; to the current node (as (8.27) does), 
we simulate a path of the Markov chain X;, Nie ...,Xm starting from the 
current node X; = Xij. To this path we apply the stopping rule (8.33) defined 
by the continuation values at i+ 1,...,m — 1. We record the payoff received 
at exercise and assign this as the value at V;;, just as we would using the low 
estimator starting from node X;,;. (In the method of Longstaff and Schwartz 
[241], the path x Xa ..., Xm is just the original path Nii Naa Hoe 525 A 
passing through X;;, but one could alternatively simulate an independent path 
from An .) 

To be more precise, we need to define a new stopping time for each 7 and 
apply it to a path starting from X;;j. Thus, for the path NN dee a A 
starting at X; = Xij, define 


iS min{£ = fi i+ lizas ,m} : he( Xe) > Ĉi(Xo)}. 
Then the two cases for assigning a value to V;; can be combined into the rule 


Viz = ha, (Xs). (8.35) 
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This procedure interleaves elements of the high and low estimators, alter- 
nating between the two by applying backward induction to estimate a contin- 
uation value and then applying a suboptimal stopping rule starting from each 
node. This may partly offset the two sources of bias, but a precise comparison 
remains open for investigation. If every path X;, Xia ..., Xm used in (8.35) 
is independent of the original set of paths (given Xi = Xij), then because 
the Vi; in (8.35) result from stopping rules, these value estimates are biased 
low. In practice, using the original mesh path Xj;,..., Xm; rather than an 
independent path will usually also result in a low bias, because in this method 
the high bias resulting from Jensen’s inequality tends to be less pronounced 
than the low bias resulting from a suboptimal stopping rule. 


8.5.2 Likelihood Ratio Weights 


In this section and in Section 8.6.2, we take up the question of defining weights 
iz that satisfy conditions (M2) and (M3) of Section 8.5.1. The alternatives 
we consider differ in their scope and computational requirements. 

This section discusses weights defined through likelihood ratios, as pro- 
posed by Broadie and Glasserman [66]. Some of the ideas from our discussion 
of importance sampling in Section 4.6.1 are relevant here as well. But whereas 
the motivation for changing probability measure in importance sampling is 
variance reduction, here we use it to correct pricing as we move backward 
through the mesh. 

Suppose that the state space of the Markov chain Xo,.Xj,...,Xm is R? 
and that the chain admits transition densities f;,..., fm, meaning that for 
ce Rr? and AC RÊ, | 


POG E AX =a) = | f(E, y) dy 2=1,...,m. 
A 


With Xo fixed, gi(-) = fi(Xo,-) is the marginal density of X,, and then by 
induction, 


gly) = [awhile y) dx 


gives the marginal density of X;, i = 2,...,m. The optimal continuation value 
in state x at time 7 is 


Cilt) = El Viri (Xaa) X= 2] = J Vino): y) dy, 


an integral with respect to a transition density. The main purpose of weights 
in the mesh is to estimate these continuation values. 

Fix a node X;; in the mesh and consider the estimation of the continu- 
ation value at that node. To motivate the introduction of likelihood ratios, 
we begin with some simplifying assumptions. Suppose that the nodes Xi+1,k, 
k = 1,...,6, at the next step in the mesh were generated independently of 
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each other and of all other nodes in the mesh from some density g. Suppose 
also that we know the true option values Vi4i(Xi+1,z4) at these downstream 
nodes. Averaging over the b nodes at step i + 1 and letting b — oo yields 


b 
1 
bee” ara X41, yo y i+1 (Y (y) dy, 


which will not in general equal the desired continuation value C;(X;;). In 
particular, if g is the marginal density gi+ı of the Markov chain at time i +1, 


then the limit 
b 


= du Vi1(Xiz+1,k) > E| Viz (Xi+1)] 


—_— 


b 


is the unconditional expected value at i + 1, whereas what we want is a 
conditional expectation. 

The purpose of the mesh weights is thus to correct for the fact that the 
downstream nodes were sampled from g rather than from the transition den- 
sity fiti(Xi,;,-). (In contrast, in the random tree method of Section 8.3 all 
successor nodes of a given node are generated from the same transition law 
and thus get equal weight.) Suppose we set each weight to be 


ee ea) 
ee 8.36 
jk g(Xi+1,k) G 


the likelihood ratio relating the transition density to the mesh density g. The 
pairs (W5, Vi+1(Xi+1,k)), k = 1,...,b, are iid. given Xij, so we now get 


b 
1 ' 
> Lv; Viti (Xi¢i6) — Eg [Wi Vita (Xi41,6)|Xey] 


= J FP valu) dy 
= | fen Xen) dy = Ci(Xij), 


which is what we wanted. (We have subscripted the expectation by g to em- 
phasize that X;41,, has density g.) This is in fact stronger than the require- 
ment in (M3), and the weights in (8.36) clearly satisfy (M2). 

Likelihood ratios provide the only completely general weights, in the fol- 
lowing sense. If W? oe is a function only of Xi; and Xj41,,, and if 


Eg [W3,h(Xi41,4)| Xis] = | fX vho) dy (8.37) 


for all bounded h : Rİ — R, then the Radon-Nikodym Theorem (cf. Appen- 
dix B.4) implies that Wir equals the likelihood ratio in (8.36) with probability 
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1. Interpret (8.37) as stating that the weights give the correct “price” at node 
Xi; for payoff h(-) at time i + 1. Uniqueness holds with other sufficiently rich 
classes of functions h. This indicates that alternative strategies for selecting 
weights must in part rely on restricting the class of admissible functions h, 
though the restriction should not exclude the value function V;+1. 


Weights in a Markovian Mesh 


We now drop the assumption that nodes in the mesh are generated indepen- 
dently and consider more general constructions consistent with the Markovian 
condition (M1). Even if we fix the mechanism used to generate the mesh, there 
is some flexibility in the choice of likelihood ratio weights that results from 
the flexibility in (M2) to allow the weights Wj, to depend on all nodes at 
times 7 and i+ 1. The alternatives we present below satisfy (M1)—(M3). 

Consider the independent-path construction of Figure 8.7 based on inde- 
pendent paths (X4;,...,Xmj), j =1,...,6. For k Æ j, Xi+1,, is independent 
of Xij; its conditional distribution given X;; is therefore just its unconditional 
marginal distribution, which has density g;1;. For k = 7, the conditional dis- 
tribution is given by the transition density out of X;;, so no weight is needed. 
We thus arrive at the weights 


Wi, = A a (8.38) 


Alternatively, we could use the fact that the pairs (Xie, Xi+1 e) £ = 
1,...,b, are iid. (in the independent-path construction) with joint density 
gi(x) fiti (x,y). The relevant likelihood ratios now take the form W;, = 1 and 


Wik GK ENSI KAER) 
In contrast to (8.38), these weights use information about the node X; from 
which X;i+1,& was generated. A simple calculation shows that each weight in 
(8.38) is the conditional expectation given X;41,, of the corresponding weight 
in (8.39). 

Next consider yet another construction of the mesh in which the nodes at 
i1+1 are generated from the nodes at í as follows. We pick a node Xj, randomly 
and uniformly from the b nodes at time i and generate a successor by sampling 
from the transition density f;+1(Xiz,-) out of Xie. We repeat the procedure 
to generate a total of b nodes at time 7 + 1, each time drawing randomly and 
uniformly from all b nodes at time i. (In other words, we sample b times “with 
replacement” from the nodes at time 2 and generate a successor from each 
selected node.) This construction is clearly consistent with the Markovian 
condition (M1). Given X;, the nodes at time ¿į + 1 are i.i.d. with density 


; IAA; , 
i Siti E a gy (8.39) 


b 
fess (Kee J (8.40) 
f=) 
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the average of the transition densities out of the nodes at time i. The corre- 
sponding likelihood ratios are 


tIl Ais Ki 
mees (8.41) 
Fp T ANAA E] 


As b — œ, the average density (8.40) converges to the marginal density 
gi+1, 50 the weights in (8.41) are close to those in (8.39). But the weights in 
(8.41) have a property that makes them appealing. If we fix k and sum over 
J, we get 


k woal (8.42) 


the average weight into a node is 1. This property may be surprising. If we 
were to interpret the ratios Wi, /b as transition probabilities, we would expect 
the average weight out of a node to sum to 1. Broadie and Glasserman [66] 
point out at an attractive feature of the less obvious condition (8.42), which 


we now explain. 


Implications of the Weight Sum 


Consider the pricing of a European option in a mesh. Suppose the option 
has (discounted) payoff hm at tm, and to price it in the mesh we use the 
recursion (8.27), but always taking the second term inside the max because 
early exercise is not permitted. The resulting estimate Vo at the root node 


can be written as 
a—l 
pm I Wie ga (Ama 


ujm = 2 


the sum ranging over all 7; = 1,...,b, i = 1,...,m. In other words, Vo is the 
average over all b™ paths through the mesh of the payoff per path multiplied 
by the product of weights along the arcs of the path. By grouping paths that 
terminate at the same node, we can write this as 


1 
b 5 D hl X mjm) hm— pm—1 j 2 w; ji— es 


J1; A ee 1 i= =2 


If the weights are defined by likelihood ratios, then the factor in parentheses 
has expected value 1; this can be verified by induction using the fact that a 
likelihood ratio always has expected value 1. Rewriting this factor as 


1 b 
b nen a. ee : Sw E Tae E 5w i 


jm—2=1 a 
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reveals that it is identically 1 if (8.42) holds. Thus, when (8.42) holds, the 
mesh estimate of the European option price is simply 


1 b 


jm=l 


the average of the terminal payofts. 

This simplification is important because multiplying weights along steps 
of a path through the mesh can produce exponentially growing variance. We 
encountered this phenomenon in our discussion of importance sampling over 
long time horizons; see the discussion surrounding (4.82). The property in 
(8.42) thus replaces factors that have exponentially growing variance with the 
constant 1. Of course, there is no reason to use a mesh to price a European 
option, so this should be taken as indirect evidence that (8.42) is beneficial in 
the more interesting case of pricing American options. 

We arrived at the weights (8.41) through a construction that generates 
nodes at time į + 1 by drawing b independent samples from the average den- 
sity (8.40), given X;. Now consider applying stratified sampling to this con- 
struction using as stratification variable the index j of the node X;; selected 
at time 7. This gives us b equiprobable strata, so in a sample of size b we 
draw exactly one value from each stratum. But this simply means that we 
draw one successor from each density fi+ı(Xij,:), 9 = 1,...,b, which is ex- 
actly what the independent-path construction does. In short, we may use the 
independent-path construction (which carries out the stratification implicitly) 
and then apply the weights (8.41). This is what Broadie and Glasserman [66] 
do in their numerical tests. Boyle, Kolkiewicz, and Tan [55] also implement 


, 


this approach and combine it with low discrepancy sequences. 


Weights for the Low Estimator 


As explained in Section 8.5.1, a stochastic mesh defines an exercise policy 
throughout the state space (and not just at the nodes of the mesh) once the 
weights are extended to all points. This is the key to the low estimator defined 
through (8.32)—(8.33) and also to the interleaving estimator in (8.35). Both 
rely on the ability to estimate a continuation value C; (2) at an arbitrary state 
x and time 2. 

As in Section 8.5.1, we use W}(x) to denote the weight on a hypothetical 
arc from state x at time 7 to mesh node X;4;,,. In all of the likelihood ratio 
weights (8.38)—(8.41) discussed in this section, the current node X;; appears 
as an explicit argument of a function (a transition density). The obvious way 
to extend (8.38)—(8.41) to arbitrary points x is thus to replace X;; with zx. 
This is the method used in Broadie and Glasserman [66]. To avoid computing 
additional values of transition densities, one might alternatively use interpo- 


lation. 
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Example 8.5.1 American geometric average option on seven assets. We con- 
sider a call option on the geometric average of multiple assets modeled by 
geometric Brownian motion. This provides a convenient setting for testing 
algorithms in high dimensions because the problem can be reduced to a single 
dimension in order to compute an accurate price for comparison. We used this 
idea in Section 5.5.1 to test quasi-Monte Carlo methods; see equation (5.33) 
and the surrounding discussion. 

We consider a specific example from Broadie and Glasserman [66]. There 
are seven uncorrelated underlying assets. Each is modeled as GBM(r — 6,07) 
with interest rate r = 3%, dividend yield 6 = 5%, and volatility ø = 0.40. The 
assets have an initial price of 100 and the option’s strike price is also 100. The 
option expires in 1 year and can be exercised at intervals of 0.1 years starting 
at 0 and ending at 1. A binomial lattice (applied to the one-dimensional 
geometric average) yields a price of 3.27. The corresponding European option 
has a price of 2.42, so the early-exercise feature has significant value in this 
example. 

In pricing the option in the stochastic mesh, we treat it as a seven- 
dimensional problem and do not use the reduction to a single dimension. We 
use the weights in (8.41). Because the seven underlying assets evolve indepen- 
dently of each other, the transition density from one node to another factors 
as the product of one-dimensional densities. More explicitly, the transition 
density from one generic node x = (z1,..., £7) to another y = (y1,...,Y7) 
over a time interval of length At is 


7 D-9 
1 log(y;/x;) — (r — 6 — 50^ )At 
Hea) -Jio (e etaan 
5 OV Atyi ov At 
with @ the standard normal density. This assumes that each x; and y; records 


the level S; of the ith underlying asset. If instead we record log S; in x; and 
Yi, then the transition density simplifies to 


= E yi — xi — (r — ô — 40° )At 


Figure 8.8 displays numerical results from Broadie and Glasserman [66] for 
this example. Computational effort increases as we move from left to right in 
the figure: the label (50,500) indicates a mesh constructed from b = 50 paths 
through which 500 additional paths are simulated to compute a low estimator, 
and the other labels should be similarly interpreted. The solid lines in the 
figure show the high and low estimators; the dashed lines show the upper and 
lower confidence limits (based on a nominal coverage of 90%) using standard 
errors estimated from 25 replications at each mesh size. The dotted horizontal 
line shows the true price. 

The bias of the low estimator is much smaller than that of the high esti- 
mator in this example. Recall that the value computed by the low estimator 
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3 
(50,500) (100,1000) (200,2000) (400,4000) (800,8000) (1600, 16000) (3200,32000) 


Fig. 8.8. Convergence of high and low mesh estimators for an American geometric 
average option on seven underlying assets. 


reflects the exercise policy determined by the mesh. This means that even at 
relatively small values of b (100, for example), the mesh does a reasonable 
job of identifying an effective exercise policy, even though the high estimator 
shows a large bias. The bias in the high estimator results from the back- 
ward induction procedure, which overestimates the value implicit in the mesh 


stopping rule. O 


Computational Costs and Limitations 


Estimating a continuation value at a single node in the mesh requires calcu- 
lating b weights and then a weighted average and is thus an O(b) operation. 
Each step in the backward induction procedure (8.27) requires estimating 
continuation values at b nodes and is therefore an O(b?) operation. Applying 
the mesh to an m-step problem requires O(mb?) time. The implicit constant 
in this computing-time magnitude depends on the time required to generate 
a single transition of the underlying Markov chain (which in turn depends 
on the problem dimension d) and on the time to evaluate each weight. Each 
replication of the low estimator requires calculation of an additional weighted 
average at each step along a path and thus requires O(mb) computing time. It 
is therefore practical to run many more low-estimator paths than the number 
of mesh paths b; but reducing the bias in the low estimator requires increasing 
b. 

Based on numerical experiments, Broadie and Glasserman [66] tentatively 
suggest that the root mean square error of the mesh estimator is O(b~1/2) 
in problems for which exact simulation of X1,..., Xm is possible. This is the 
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same convergence rate that would be obtained in estimating the price of a 
European option from b independent replications. But because the compu- 
tational requirements of the mesh are quadratic in b, the convergence rate 
is a rather slow O(s~'/+) when measured in units of computing time s. In 
contrast, European option pricing through independent replications retains 
square-root convergence in s as well as b. In Section 8.6.2 we will see that 
using regression the time required for each step in the backward induction is 
proportional to b rather than b’, resulting in faster overall run times, usually 
at the cost of some approximation error from the choice of regressors. 

Broadie and Yamamoto [70] use a fast Gauss transform to accelerate back- 
ward induction calculations. This method reduces the computational effort for 
each step in the mesh from O(b) to O(b), but entails an approximation in 
the evaluation of the weights. Broadie and Yamamoto [70] find experimentally 
that the method works well in up to three dimensions. 

An important feature of likelihood ratio weights is that they do not depend 
on the payoff functions h;. Once a mesh is constructed and all its weights com- 
puted, it can be used to price many different American options. This requires 
storing all weights, but significantly reduces computing times compared to 
generating a new mesh. 

Based on numerical tests, Broadie and Glasserman [66] emphasize the 
importance of using control variates with the stochastic mesh. They use both 
inner and outer controls: inner controls apply at each node in the mesh, outer 
controls apply across independent replications of the mesh. Candidate control 
variates are moments of the underlying assets and the prices of European 
options, when available. 

The main limitation on applying the mesh with likelihood ratio weights 
is the need for a transition density. Transition densities for the underlying 
Markov chain may be unknown or may fail to exist. We encountered similar 
issues in estimating sensitivities in Section 7.3; see especially Section 7.3.2. 

Recall that the state X; of the underlying Markov chain records infor- 
mation at the ith exercise date t;. The intervals t;}ı — t; separating exercise 
dates could be large, in which case we may need to simulate X;,, from X; 
through a series of smaller steps. Even if we know the transition density over 
a single step, we may not be able to evaluate the transition density from t; to 
ti+1. We faced a similar problem in the setting of the LIBOR market model in 
Section 7.3.4. In such settings, it may be necessary to use a multivariate nor- 
mal or lognormal approximation to the transition density in order to compute 
(approximate) likelihood ratios. 

As discussed in Section 7.3.2, transition densities often fail to exist in 
singular models, meaning models in which the dimension of the state vector 
exceeds the dimension of the driving Brownian motion. This is often the case 
in interest rate models. It can also arise through the introduction of supple- 
mentary variables in the state vector X;. Consider, for example, the pricing 
of an American option on the average of a single underlying asset S. One 
might take as state X; the pair (S(t;), S(t;)), in which S(t;) records the cur- 
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rent level of the underlying asset and S(t;) records the average over ty,..., ti. 
This formulation eliminates path-dependence through the augmented state 
but results in a singular model: given (S(t;), S(¢;)) and S(t;41), the value of 
S(ti41) is completely determined, so the Markov chain X; does not admit a 


transition density on R2. Weights based on likelihood ratios are infeasible in 


such settings. 


Constrained Weights 


To address the problem of unknown or nonexistent transition densities, 
Broadie et al. [68] propose a method that selects weights through a constrained 
optimization problem. This method relies on the availability of known condi- 
tional expectations, which are used to constrain the weights. 

Suppose, then, that for some ”-valued function G on the state space of 
the underlying Markov chain, the conditional expectation 


g(x) = E[G(Xi+1)| X; = x] 


is a known function of the state x. For example, moments of the underlying 
assets and simple European options often provide candidate functions. Fix a 
node X,; and consider weights W;,, k =1,...,6, satisfying 


b 
1 
5D, WyeG(Xis1,n) = 9(Xig)s (8.44) 
k=1 


these are weights that correctly “price” the payoff G, to be received at i + 1, 
from the perspective of node X;;. If Vizi1 is well-approximated by a linear 
combination of the components of G, then such weights should provide a 
good approximation to the continuation value at X;; when used in the basic 


mesh recursion (8.27). 
Taking one of the components of G to be a constant imposes the constraint 


b 
SW}, = 1. (8.45) 
k=1 


We noted previously that the alternative condition (8.42) — which constrains 
the sum of the weights into a node rather than out of a node — has an 
appealing feature. But that condition links the choice of all b? weights from 
every node at step i to every node at step i + 1, whereas (8.44) and (8.45) 
apply separately to the b weights used for each node X;;. 

The number M of easily computed conditional expectations is likely to be 
much smaller than the number of nodes b, in which case the constraints (8.44) 
do not completely determine the weights. From all feasible weights, Broadie 
et al. [68] select those minimizing an objective of the form 
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DS H( a 
k=1 


for some convex function H. The specific cases they consider are a quadratic 
objective H(x) = x?/2 and the entropy objective H(z) = xlogaz. The 
quadratic leads to a simpler optimization problem but could produce negative 
weights, whereas the entropy objective will provide a nonnegative feasible so- 
lution if one exists. Broadie et al. [68] report numerical results showing that 
the two objectives result in similar price estimates and that these estimates 
are close to those produced using likelihood ratio weights in cases in which a 
transition density is available. 

This approach to selecting mesh weights is closely related to weighted 
Monte Carlo as presented in Section 4.5.2 and further analyzed in Glasser- 
man and Yu [147]. As shown in Glasserman and Yu [148], it is also closely 
related to the regression-based methods we discuss next. A distinction be- 
tween the methods is that the constrained optimization problem produces 
weights Wye that depend on both X;; and Xi+1,k, whereas weights defined 
through regression (in the next section) depend on X;; and Xix. 


8.6 Regression-Based Methods and Weights 


Several authors — especially Carriére [78], Longstaff and Schwartz [241], and 
Tsitsiklis and Van Roy [350, 351] — have proposed the use of regression to 
estimate continuation values from simulated paths and thus to price American 
options by simulation. Each continuation value C;(x) in (8.11) is the regres- 
sion of the option value V;+1(X;41) on the current state x, and this suggests 
an estimation procedure: approximate C; by a linear combination of known 
functions of the current state and use regression (typically least-squares) to es- 
timate the best coefficients in this approximation. This approach is relatively 
fast and broadly applicable; its accuracy depends on the choice of functions 
used in the regression. The flexibility to choose these functions provides a 
mechanism for exploiting knowledge or intuition about the pricing problem. 

Though not originally presented this way, regression-based methods fit 
well within the stochastic mesh framework of Section 8.5: they start from the 
independent-path construction illustrated in Figure 8.7, and we will see that 
the use of regression at each step corresponds to an implicit choice of weights 
for the mesh. (We encountered a related observation in (4.19)—(4.20) where 
we observed that using control variates for variance reduction is equivalent to 
a particular way of assigning weights to replications.) This section therefore 
uses notation and ideas from Section 8.5. 


8.6.1 Approximate Continuation Values 


Regression-based methods posit an expression for the continuation value of 
the form 
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E[ Vi (Xi+1) Xi = q] = Y Betela) ’ (8.46) 


for some basis functions Y, : Rİ + R and constants Bir, r = 1,..., M. We 
discussed this as an approximation strategy in Section 8.2. Using the notation 
C; for the continuation value at time i, we may equivalently write (8.46) as 


Ci(x) = pr yle), (8.47) 
with 
6 ed aen a VSG) ae)” 


One could use different basis function at different exercise dates, but to sim- 


plify notation we suppress any dependence of Y% on i. 
Assuming a relation of the form (8.46) holds, the vector (; is given by 


bi = (El Xi)b(Xs)"]) EW(X:)Vi+1(Xi+1)] = By* Byv- (8.48) 
Here, By is the indicated M x M matrix (assumed nonsingular) and Byy the 
indicated vector of length M. The variables (X;, X;41) inside the expectations 
have the joint distribution of the state of the underlying Markov chain at dates 
i and i+ 1. 

The coefficients Oir could be estimated from observations of pairs (Xj, 
Vi41(Xi41,3)), 9 = 1,...,b, each consisting of the state at time 7 and the 
corresponding option value at time 2+ 1. Consider, in particular, independent 
paths (X1;,...,Xmj), j =1,...,6, and suppose for a moment that the values 
Vi+i(Xi+1,;) are known. The least-squares estimate of 8; is then given by 


Bi = By Bov, 


where By and Byy are the sample counterparts of By and Byy. More ex- 
plicitly, By is the M x M matrix with gr entry 


et 


b 
dal Xij)br(Xiz) 
and Byy is the M-vector with rth entry 


b 
1 
5 Sr (Xin) Vier (Xite). (8.49) 
k=] 


All of these quantities can be calculated from function values at pairs of 
consecutive nodes (X;;,.Xi41,;), 7 = 1,...,6. In practice, Vj4a is unknown 
and must be replaced by estimated values Vii at downstream nodes. The 
estimate B; then defines an estimate 


Ci(z) = 6) v(2), (8.50) 


of the continuation value at an arbitrary point x in the state space R. This 
in turn defines a procedure for estimating the option value: 
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Regression-Based Pricing Algorithm 


(i) Simulate b independent paths {Xy,,...,Xmj}, J = 1,...,6, of the 
Markov chain. 

(ii) At terminal nodes, set Vij Sia pg) = lee: 

(iii) Apply backward induction: for i = m —1,...,1, 


o given ane values Vi41,;, 7 = 1,...,6, use regression as above to 
calculate 8; = BT! Bi 
o set i 
Vij = max Ea eer (8.51) 


with C; as in : 50). 
(i v) Set Y= = (Vit free Vin)/b. 


This is the approach introduced by Tsitsiklis and Van Roy [350, 351]. They 
show in [351] that if the representation (8.46) holds at all i = 1,...,m — 1, 
then the estimate Vo converges to the true value Vo(Xo) as b — oo. Longstaff 
and Schwartz [241] combine continuation values estimated using (8.50) with 
their interleaving estimator discussed in Section 8.5.1. In other words, they 
replace (8.51) with 


ee hi(Xij), hi(X P E C(X DG) 
= {tas e COs). een 


They recommend omitting nodes X;; with h;(X;;) = 0 in estimating ĝ;. 
Clément, Lamberton, and Protter [86] prove convergence of the Longstaff- 
Schwartz procedure as b — oo. The limit attained coincides with the true price 
Vo(Xo) if the representation (8.46) holds exactly; otherwise, the limit coincides 
with the value under a suboptimal exercise policy and thus underestimates 
the true price. In practice, (8.52) therefore produces low-biased estimates. 

The success of any regression-based approach clearly depends on the choice 
of basis functions. Polynomials (sometimes damped by functions vanishing at 
infinity) are a popular choice ([350, 241]). Through Taylor expansion, any 
sufficiently smooth value function can be approximated by polynomials. How- 
ever, the number of monomials of a given degree grows exponentially in the 
number of variables, so without further restrictions on the value function the 
number of basis functions required could grow quickly with the dimension of 
the underlying state vector X;. Longstaff and Schwartz [241] use 5—20 basis 
functions in the examples they test. 

We have presented the regression-based pricing algorithm above in the 
discounted formulation (8.6)—(8.7) of the dynamic programming problem, as 
we have all methods in this chapter. ‘This assumes that payoffs and value 
estimates are denominated in time-0 dollars. In practice, payoffs and value 
estimates at time t; are usually denominated in time-t; dollars, and this re- 
quires including explicit discounting in the algorithm. Consider, for example, 
the case of a constant continuously compounded interest rate r. In step (ii) 
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of the algorithm, we would use A as in (8.4); in step (iii) we would regress 
e~Ttiti-t) V; 1 j (rather than Vi+1 j) against Y(X;); in (8.49) we would re- 
place Viy (Xi+1,k) with e7" V; 1(Xi+1,k). With these modifications, 
C;(x) in (8.50) would be interpreted as the present value of continuing (de- 
nominated in time-t; dollars). 


Low Estimator 


The estimated vectors of coefficients Â; determine approximate continuation 
values C;(x) for every step i and state x, through (8.50). These in turn define 
an exercise policy, just as in (8.33), and thus a low estimator as in (8.34). 
This is the low estimator of the stochastic mesh applied to continuation values 
estimated through regression. The method of Longstaff and Schwartz [241] in 
(8.52) usually produces a low-biased estimator as well, though as explained 
in Section 8.5.1, their interleaving estimator mixes elements of high and low 


bias. 


Example 8.6.1 American maz option. To illustrate regression-based pric- 
ing, we consider an American option on the maximum of underlying assets 
modeled by geometric Brownian motion. This example is used in Broadie and 
Glasserman [65, 66] and Andersen and Broadie [15] with up to five underlying 
assets; for simplicity, here we consider just two underlying assets, Sı and So. 
Each is modeled as GBM(r — 6,07) with interest rate r = 5%, dividend yield 
ô = 10%, and volatility 0 = 0.20. The two assets are independent of each 
other. The (undiscounted) payoff upon exercise at time t is 


h(S1(t), S2(t)) = (max($1(¢), S2(t)) — K)*. 


We take S1(0) = S2(0) and K = 100. The option expires in T = 3 years and 
can be exercised at nine equally spaced dates t; = 71/3, i = 1,...,9. Valuing 
the option in a two-dimensional binomial lattice, as in Boyle, Evnine, and 
Gibbs [54], yields option prices of 13.90, 8.08, and 21.34 for S;(0) = 100, 110, 
and 90, respectively. ‘These are useful for comparison. 

To price the option using regression and simulation, we need to choose 
basis functions. We compare various combinations of powers of the underlying 
assets, the immediate exercise value h, and related functions. For each case 
we apply the regression-based pricing algorithm (i)—(iv) with 4000 paths. The 
continuation values estimated by regression define an exercise policy so we 
then simulate a second set of 4000 paths that follow this exercise policy: this 
is the low estimator in (8.33)—(8.34) based on the regression estimate (8.47) of 
the continuation value. We also apply the method of Longstaff and Schwartz 
[241] in (8.52). We replicate all three of these estimators 100 times to calculate 
standard errors. 

Tables 8.1 and 8.2 display numerical results for various sets of basis func- 
tions and initial values of the underlying assets. The basis functions appear 
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in Table 8.1. We treat Sı and S2 symmetrically in each case; the abbreviation 
S?, for example, indicates that both S? and $3 are included. We also include 
a constant in each regression. The results in Table 8.1 are for S;(0) = 100; Ta- 
ble 8.2 is based on the same basis functions but with S;(0) = 90 (left half) and 
5;(0) = 110 (right half). The estimates have standard errors of approximately 
0.02 to 0.03. 

The results in the tables show that the pure regression-based estimator 
can have significant high bias even with nine reasonable basis functions in a 
two-dimensional problem (as in the third row of each table). The choice of 
basis functions clearly affects the price estimate. In this example, including 
the interaction term S1 S2 and the exercise value h(.S;, 52) appears to be par- 
ticularly important. (Andersen and Broadie [15] get nearly exact estimates for 
this example using twelve basis functions, including the value of a European 
max option.) 


Basis Functions Regression Low LSM 
LS. 15.74 13.62 13.67 
1553597 S3, 6183 15.24 13.65 13.68 
1, Si, 97, $3, S1S2, max(S$1, S2) 15.23 13.64 13.63 
1, Si, S?, S$, S152, S? S2, S153 15.07 13.71 13.67 


1, Si, SÈ, SP, S1S2, 9792, S192, A(S1,S2) 14.06 13.77 13.79 
1, Si, S2, S192, A(S1, 82) 14.08 13.78 13.78 


Table 8.1. Price estimates for an American option on the maximum of two assets. 
The true price is 13.90. Each estimate has a standard error of approximately 0.025. 


Regression Low LSM {Regression Low 


Table 8.2. Price estimates for out-of-the-money (left) and in-the-money (right) 
American option on the maximum of two assets. True prices are 8.08 and 21.34. 
Each estimate has a standard error of approximately 0.02-0.03. 


The low estimates appear to be less sensitive to the choice of basis func- 
tions. As the high bias in the regression estimate decreases, the low estimate 
generally increases; both properties result from a better fit to the continua- 
tion value. The ordinary low estimator (labeled “Low”) and the Longstaff- 
Schwartz estimator (labeled “LSM”) give nearly identical results. Longstaff 
and Schwartz [241] recommend including only in-the-money nodes X;; in the 
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regression used to estimate the continuation value C4; this alternative gives 
inferior results in this example and is therefore omitted from the tables. 

Though it is risky to extrapolate from limited numerical tests, this example 
suggests that using either of the low-biased estimators is preferable to relying 
on the pure regression-based estimator, and that neither of the low-biased 
estimators consistently outperforms the other. As expected, the choice of basis 
functions has a significant impact on the estimated prices. 

Figure 8.9 displays exercise regions at i = 4, the fourth of the nine ex- 
ercise opportunities. The dashed lines show the optimal exercise boundary 
computed from a binomial lattice: it is optimal to exercise the max option if 
the price of one — but not both — of the underlying assets is sufficiently high. 
(This and other properties of the exercise region are proved in Broadie and 
Detemple [63].) The shaded area shows the exercise region estimated through 
the last regression in Table 8.1. More specifically, the shaded area corresponds 
to points at which h(.S;, S2) is greater than or equal to the estimated contin- 
uation value. The regression estimate generally comes close to the optimal 
boundary but erroneously indicates that it is optimal to exercise in the lower- 
left corner where the regression estimate of the continuation value is negative. 
This can be corrected by replacing C(x) with max{0, C;(z)} and exercising 
only if A(S1, S2) is strictly greater than the continuation value. The results in 
Table 8.1 use this correction. 


60 80 100 20 140 


Fig. 8.9. Exercise region for American max option on two underlying assets. The 
shaded area is the exercise region determined by regression on seven basis functions; 


dashed lines shows optimal boundary. 
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8.6.2 Regression and Mesh Weights 


We indicated at the beginning of this section that the regression-based al- 
gorithm (i)-(iv) corresponds to a stochastic mesh estimator with an implicit 
choice of mesh weights. We now make this explicit. 

Using the regression representation (8.50) and then (8.49), we can write 
the estimated continuation value at node X;; as 


b 
` (U(X) Er V(X) Vit1k- (8.53) 


Thus, the estimated continuation value at node X;; is a weighted average of 
the estimated option values at step 7+ 1, with weights 


fe = (ig) By (Xia). (8.54) 


In other words, (8.53) is a special case of the general mesh approximation 


b 
1 
Xy) = 7 9 Wie Vitwe- (8.55) 
k=1 


This extends to arbitrary points x in the state space, as in (8.32), if we simply 
replace X;; with x in (8.54) and (8.55). 

We made similar observations in our discussion of the link between control 
variates and weighted Monte Carlo in (4.19)—(4.20) and Example 4.5.6, and 
we can put the weights in (8.54) in the form appearing in (4.20). In the 
representation (8.46) of the value function, one would often take one of the 
basis functions to be a constant. To make this explicit, let Yọ = 1 and add a 
term Bioyolx) in (8.46). Let Sy denote the sample covariance matrix of the 
other basis functions: this is the M x M matrix with qr entry 


hi an 
bal N (bg (Xij or (X g) k bgr) ’ 


j=l 


where each wz is the sample mean of yẹ values at time i, 


ee 
a Pa) 


If Sy is nonsingular, then some matrix algebra shows that the regression 
weights (8.54) can also be written as 
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fe = 14 WX) -PTS W(Xu)-B), (856) 
where Ņ% is the M-vector of sample means Yp, r = 1,..., M. (Had we not 
removed the constant Yo, the sample covariance matrix would have no chance 
of being invertible.) This expression has the same form as (4.20). Like (8.54), 
it extends to an arbitrary point z in the state space if we replace X;; with 
x; this substitution defines the extended weights W{(x) needed for the low 
estimator and interleaving estimator discussed in Section 8.5.1. 
We make the following observations regarding this formulation: 


o The regression-based weights (8.54) and (8.56) are symmetric in 7 and k. 
They sum to b if we either fix j and sum over k or fix k and sum over 7. 
This is relevant to the discussion following (8.42). 

o The regression-based weights satisfy conditions (M1) and (M2) of Sec- 
tion 8.5.1. They satisfy condition (M3) if the regression equation (8.46) 
is valid. 

o From (8.56) we see that E[W%,|Xjj] is nearly 1 for any k # j. In this 
respect the regression weights resemble likelihood ratios. But in contrast to 
likelihood ratios the regression weights can take negative values. 

o All of the likelihood ratio weights considered in Section 8.5.2 share the 
property that the weight Wir depends on both the origin and destination 
nodes X;; and X;+1,k. In e A the weight assigned through regression 
depends on X;; and Xik, but not on any values at step i +1. This results in 
weights that are less variable but also less tuned to the observed outcomes. It 
also means, curiously, that the weights implicit in regression are insensitive 
to the spacing ti+ı — ti between exercise dates. Using least-squares and 
(8.44) produces weights W;, that depend on Xi; and Xi+1,x- 

o The regression weight W?, depends on a type of distance between X;; and 
Xik, or rather between ~(X;;) and Y(Xik). Consider, for example, what 
happens when By is the identity matrix. If ~(X;;) and W(X.) are orthog- 
onal vectors, the weight Wi, is zero; the absolute value of the weight is 
greatest Shea these vectors are multiples of each other. Points Y(Xig) that 
are equidistant from 7(X;,;) in the Euclidean sense can get very different 
weights te) as illustrated (for M = 2 dimensions) in Figure 8.10. 

o As noted in the discussion surrounding (8.37), only likelihood ratio weights 
can correctly price all payoffs at time i + 1 from the perspective of node 
Xij. The regression weights correctly price payoffs whose price at time 7 as 
a function of the current state is a linear combination of the basis functions 
Wr. In practice, the true continuation value C; is unlikely to be exactly 
a linear combination of the basis functions, so the regression procedure 
produces (an estimate of) the projection of C; onto the span of #1,..., Ym. 

o Calculating all estimated continuation values C; (Xia), j = 1,...,b, using 
the regression representation (8.50) requires a single OM 3) calculation to 
compute ĝ; and then O(Mb) calculations to compute 3, Y(X:;) at all b 
nodes. In contrast, making explicit use of the weights as in (8.55) requires 


8.6 Regression-Based Methods and Weights 467 


Fig. 8.10. The three points on the circle are the same distance from the center 
but are assigned different weights by node j: if By = J, the ith node gets positive 
weight, the nth gets negative weight, and the kth gets zero weight. 


O(M?) operations to calculate the weights and then O(b?) operations to 
estimate all b continuation values. Because one would typically set 6 much 
larger than M, this comparison favors taking advantage of the special struc- 
ture of the weights, as in (8.50). If one prices multiple options using a single 
mesh, the coefficients 3; need to be recomputed for each option whereas the 
weights (8.56) do not. The computational effort required for (8.55) can be 
reduced to O(M*? + M?b) by rewriting (8.54) as 
Wi, = Ej Ek 
with 
ep = Ald(Xin), AA’ = Bz". 


Computing A is an O(M°) operation. Computing £ẹ given A is O(M°) and 
repeating this for k = 1,...,b is O(M°b). We can rearrange the calculation 
of the sum in (8.55) as 


b b b 
NO Wi Vi+i,k = N e] enVit = E} (>: asn) ' 

k=1 k=1 k=1 
Calculating the term in parentheses is O(Mb). This term is independent of j, 
so calculating its inner product with all €;, 7 =1,...,6, is also O( Mb). This 
is advantageous if several options are to be priced using the same mesh and 
basis functions because the matrix inversion is executed just once, rather 
than separately for each option. 


Example 8.6.2 Weights for a Brownian mesh. To illustrate the idea of 
regression-based weights, we consider a simple example in which the under- 
lying Markov chain X; is the state of a standard one-dimensional Brownian 
motion at dates t; = i. We consider the estimation of a continuation value (or 
other conditional expectation) at tı = 1, at which time the Brownian motion 
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has a standard normal distribution. For basis functions we use the constant 
1 and powers x” for n = 1,..., M. With M = 4 and Z denoting a standard 
normal random variable, the matrix By becomes 


LAG eG 101 0 8 
ZL Lo 2 010 3 0 
ByE Z 22? Zoe A eB o Ao 
LURID Ht 03 0 15 0 
ie ce oy ay A 3 0 15 0 105 


Its inverse is 
15/8 0 —5/4 0 1/8 
0 5/2 0 -1/2 0 
By’ =|-5/4 0 2 0 -1/4 
0 -1/2 0 1/6 0 
1/8 0 -1/4 O 1/24 


Consider two nodes X1; =z; and Xi, = xx; these correspond to two possible 
values of the Brownian motion at time 1. The corresponding weight Wj, = 


Wir is given by 


Wir = p(x;)By lak) 


15 5 5 1 

Tug 4 (3 + 2%) + gTITk a 3 (tj a) — = (aja + 2; j0R) + 2x5 a 
1 1 1 
4 (3 Lh HITR) + Br; x5 x Tk + zg? r 


More generally, regressing on powers up to order M makes each W;, an M-th 
order polynomial in z; for each value of z4. 

The solid line in Figure 8.11 plots Wj, as a function of z, for x; = 1. 
Taking x; = 1 means we are computing a conditional expectation (or price) 
given that the Brownian motion is at 1 at time tı. Regression computes this 
conditional expectation as a weighted average of downstream values that are 
successors of other nodes zę at time tı. The solid line in the figure shows 
the weight assigned to the successor of xz, as the value of x, varies along the 
horizontal axis. The weights oscillate and then drop to —oo as x, moves away 
from x; = 1; these properties result from the polynomial form of the weights. 

The dashed line in Figure 8.11 shows the conditional expectation of like- 
lihood ratio weights based on (8.36). A direct comparison between likelihood 
ratio weights and regression weights is not possible because (8.36) depends 
on Xij and Xi+1,k, whereas the corresponding regression weight depends on 
Xij and X;,. By taking the conditional expectation of (8.36) given X;, and 
Xij, we make it a function of these nodes. In more detail, we consider the step 
from t; = 1 to tj;41 = 2. The transition density is normal, f(x,y) = (y — x) 
with ¢ the standard normal density. The marginal distribution at tg = 2 is 
N(O, 2). The likelihood ratio (8.36) is then 
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Wye = V2exp (4X7, — 4(Xi+i,k — 2Xiy)”). (8.57) 
Using the fact that Xi+1,k — Xj, has a standard normal distribution, we can 
calculate the conditional expectation of W;, given Xij and Xix to get 


EW k AG = Xie = a] = a exp (—(x% — Ax jt, + x*)/6) l 
Viewed as a function of x, with zx; held fixed, this is a scaled normal density 
centered at 2x;. It is plotted in Figure 8.11 with z; = 1. The unconditional 
weights in (8.57) also traverse a scaled normal density, centered at 2X;;, if we 
fix Xij and let X;41,. vary. The dashed line in the figure is more consistent 
with intuition for how downstream nodes should be weighted than is the solid 


line. O 


-6 -4 a 2 4 6 


0 
Node Value 


Fig. 8.11. Comparison of regression-based weight (solid) and conditional expecta- 
tion of likelihood ratio weight (dashed). The curves show the weight assigned by 
node z; = | to the successor of node x, as a function of xx. 


Example 8.6.3 Option on a single underlying asset. Consider the pricing of 
an American call option on an asset modeled by geometric Brownian motion. 
The option expires in three years and can be exercised at any of 10 equally 
spaced exercise opportunities: m = 10 and t; = 0.32 for i = 1,...,m. The 
payoff upon exercise at t; is (S(t;) — K)*, with K = 100 and the underlying 
asset S described by GBM(r — 6,07), with $(0) = 100, volatility o = 0.20, 
interest rate r = 5%, and dividend yield 6 = 10%. (In the absence of dividends, 
early exercise would be suboptimal and the option would reduce to a European 
call.) This option is simple enough to be priced efficiently in a binomial lattice, 
which yields a price of 7.98. 
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Figure 8.12 shows the value of the option at tı as a function of the level 
S(t) of the underlying asset. The solid line shows the value function com- 
puted using a 2000-step binomial lattice. The other two lines show estimates 
computed from an 800-path mesh generated using the independent-path con- 
struction. The dashed line shows the value estimated using regression with 
basis functions 1, £, z?, 2°, and x* and the dotted line is based on the likeli- 
hood ratio weights (8.41) with transition densities evaluated as in each factor 
of (8.43). In both cases, the value displayed is for time tı and therefore results 
from nine applications of the backward induction starting from the terminal 
time tm. Both estimates come quite close to the exact value (as computed in 
the binomial lattice), though the estimates based on likelihood ratio weights 
are somewhat closer. As the number of paths increases, we should expect the 
likelihood ratio values to approach the exact values while errors in the re- 
gression values persist, because the exact value is not given by a fifth-order 
polynomial. The greater accuracy of the likelihood ratio weights suggested 
by the figure must be balanced against their much greater computational 
burden; see the discussion of computational considerations near the end of 


Section 8.5.2. O 


Likelihood Ratio 
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Fig. 8.12. Comparison of exact and estimated value functions for a call option on 
a single underlying asset as functions of the price of the underlying asset. 


8.7 Duality 


Throughout this chapter, we have formulated the American option pricing 
problem as one of maximizing over stopping times. Haugh and Kogan [172] 
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and Rogers [308] have established dual formulations in which the price is rep- 
resented through a minimization problem. The dual minimizes over a class 
of supermartingales or martingales. These duality results lead to useful ap- 
proximations and upper bounds on prices, in addition to having theoretical 
interest. 

We continue to work with discounted variables, as in (8.6)—(8.7), meaning 
that all payoffs and values are denominated in time-0 dollars. An immediate 
consequence of the dynamic programming recursion (8.7) is that 


Vi( XG) > E[Vig1 (Xi41)| Xa], 


for allt = 0,1,...,m—1, and this is the defining property of a supermartingale. 
Also, V;(X;) > h;(X;) for all i. The value process V;(X;), i =0,1,...,m, is in 
fact the minimal supermartingale dominating h;(X;). Haugh and Kogan [172] 
extend this characterization to formulate the pricing of American options as 
a minimization problem. 

Rogers [308] proves a continuous-time duality result which we specialize to 


the case of m exercise dates. Let M = {M;,i =0,...,m} be a martingale with 
Mo = 0. By the optional sampling property of martingales, for any stopping 
time 7 taking values in {1,...,m} we have 


E|h,(X,)] = E[h,-(X,) — M+] < E[ max thn( Xx) — Mx}}, 


and thus 
Elh,(X,)| < inf E| max {hk( Xk) — Mz} ], 


the infimum taken over martingales with initial value 0. Because this inequal- 
ity holds for every 7, it also holds for the supremum over 7, so 


Vo(Xo) = supE[h,(Xr)] < inf Ef max {he(Xx) — Me}] (8.58) 


The minimization problem on the right is the dual problem. 

What makes (8.58) particularly interesting is that it holds with equality. 
We show this by constructing a martingale for which the expectation on the 
right equals Vo( Xo). To this end, define 


A, = V;(X;) = E(V;(X;)|X¢-1], A= l, ee” M, (8.59) 
and set 
M; = Ayt---+A;, S hem (8.60) 


with Mop = 0. That this process is indeed a martingale follows from the prop- 
erty 
Ej AG x71; 0 (8.61) 


of the differences A. 
We now use induction to show that 
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Vi(X3) = max{h,;(X;), hi+ı(Xi+1) == ate le 2% 
hm(Xm) — Am — +++ — Airi}, (8.62) 


for alli = 1,...,m. This holds at i = m because Vin(Xm) = hm(Xm). As- 
suming it holds at 7, then using 


Vi-1(Xi-1) = max{hy—1(Xi-1), E[Vi (X)| Xi-1]} 
= max{h;_1(X;-1), V;(X;) = Ai} 


we see that it extends to i — 1. 
The option price at time 0 is 


Vo(Xo) = E[Vi(X1)|Xo] = VilX1) — Ar. 
By rewriting Vı(Xı) using (8.62), we find that 
Vo(Xo) = max (he(Xz) — Me), (8.63) 


thus verifying that equality is indeed achieved in (8.58) and that optimality is 
attained by the martingale defined in (8.59)—(8.60). This optimal martingale 
is a special case of one specified more generally by the Doob-Meyer decompo- 
sition of a supermartingale; see Rogers [308]. 

The martingale differences (8.59) can alternatively be written as 


A; = Vi(X;) — Cy_-1(X3-1) (8.64) 
with C;_1 denoting the continuation value, as in (8.12). Because Co(Xo0) = 
Vo(Xo), finding the optimal martingale appears to be as difficult as solving 
the original optimal stopping problem. But if we can find a martingale M 
that is close to the optimal martingale, then we can use 


max (he (Xx) z Mi) (8.65) 


=1,...,m 


to estimate (an upper bound for) the option price. 

Where can we find nearly optimal martingales? A general strategy is to 
construct a martingale from an approximate value function or stopping rule. 
This idea is developed and shown to be effective in Andersen and Broadie 
[15] and Haugh and Kogan [172]. A suboptimal exercise policy provides a 
lower bound on the price of an American option; the dual value defined by 
extracting a martingale from the suboptimal policy complements the lower 
bound with an upper bound. 

There are various characterizations of the martingale associated with the 
optimal value function and these suggest alternative strategies for constructing 
approximating martingales. We discuss two such approaches based on (8.64). 
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Martingales from Approximate Value Functions 


Let C; denote an approximation to the continuation value C;, i = 0,1,...,m-— 
1. For example, Ĉ; could result from a parametric approximation, as in Sec- 
tion 8.2, or through regression, as in (8.47). The associated approximate value 
function is V;(x) = max{h,(zx), C;()}. 

Along a simulated path Xo, X1,...,Xm, one could evaluate the differences 


Ay = V(X) = C1 ei) 
approximating (8.64). However, these are not in general martingale differences 


because they may fail to satisfy (8.61); in particular, there is no guarantee that 
the approximate value function satisfies 


Cy-1 (X41) = E[Vi(Xi) |Xi-1]. 


If the A; do not satisfy (8.61), then using their sum Mp in (8.65) will not 
produce a valid upper bound. 

To extract martingale differences from an approximate value function, we 
need to work a bit harder and compute (an estimate of) 


Â; = (Xi) ~ E(X) Xi]. (8.66) 


The first term on the right merely requires evaluating the approximation V; 
along a simulated path of the Markov chain. For the second term, we can use 
a nested simulation. At each step X;—ı of the Markov chain, we generate n 


1 n 
successors K Dp i X. ) and use 


fee os 
DD oa (8.67) 
j=1 


to estimate the conditional expectation of V;(X;) given X;-1. We discard 
these n successors and generate a new one to get the next step X; of the path 
of the Markov chain. Thus, each nested simulation evolves over just a single 
step of the Markov chain. (As discussed in Glasserman and Yu [148], nested 
simulations become unnecessary if one selects basis functions for which the 
required conditional expectations can be evaluated in closed form. ) 

The simulated values 


A 


: Ting 3 
Ay = ÙX) — = SO V(X") 
j=l 


are martingale differences, even though the second term on the right is only an 
estimate of the conditional expectation. This term is conditionally unbiased 
(given X;-1), so the conditional expectation of A; given X;_1 is zero, as 
required by (8.61). It follows that using these Â; in (8.65) is guaranteed to 
produce a high-biased estimator, even for finite n. 


474 8 Pricing American Options 


Haugh and Kogan [172] propose and test other methods for constructing 
upper bounds from approximate value functions. Their formulation of the dual 
problem takes an infimum over supermartingales rather than martingales, and 
they therefore use V; to construct supermartingales. 


Martingales from Stopping Rules 


Consider an exercise policy defined by stopping times 71,...,7n with 7; inter- 
preted as the exercise time for an option issued at the ith exercise date — in 
particular, 7 > i. Suppose these stopping times are defined by approximate 


continuation values C 1=1,...,m, with o= 0, through 
Ti = min{k = i,... m : hk(Xk) > Ce(Xx)}, (8.68) 


for i = 1,..., m, much as in (8.13). For example, C; might be specified through 
the regression representation (8.50). 
If these stopping times defined an optimal exercise policy, then the differ- 


ences 


A; = Elh., (Xr) Xil E Elh., (Xr) Xi-1] (8.69) 
would define an optimal martingale: this difference is just another way of writ- 
ing (8.64). Even under a suboptimal policy, these A; are martingale differences 
and thus lead to a valid upper bound. This therefore provides a general mech- 
anism for defining martingales from stopping rules. 

Evaluating these martingale differences requires computing (or estimating) 
conditional expectations and for these we again use nested simulations. The 
expression in (8.69) involves two conditional expectations which may appear 
to require separate treatment. Rewriting the first term on the right side of 


(8.69) as 


7 hi(X;), if hy(X;) > Ci(Xi); 
ie (Kn) A 7 a (X5.41) [Xi], if hi(X;) < C;(X;), on) 


reveals that the only conditional expectations we need to estimate are 
E[r (Xr [Xe], &=0,1,...,.m—1. 


These can then be used in (8.70) and (8.69) to compute the Aj. 
The method proceeds by replicating the following steps: 


o Simulate a path Xo, Xj1,...,Xm of the underlying Markov chain. 
o At each X;, i = 0,1,...,m — 1, 
— evaluate h;(X;) and Ĝi (Xi) and check which is larger, as in (8.70), taking 
ho = 0; 
— simulate n subpaths starting from X; and following the exercise policy 
Ti+1; record the payoff h,,,,(X-,,,) from each subpath and use the aver- 
age as an estimate of E[h,,,,(X7,..,)|Xz]. 
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o Combine the estimates of the conditional expectations as in (8.69) and 
(8.70) to estimate the differences Aj. 

o Sum these differences to get My = Ay +- + Aes Bes Ace 

o Evaluate the maximum of h,(X;,) — My over k =1,...,m as in (8.65). 


Observe that whereas each subpath used in (8.67) evolves for exactly one step 
of the Markov chain, here the subpaths evolve for a random number of steps 
determined by the stopping rule. 

This is the method of Andersen and Broadie [15], but formulated in terms 
of the differences (8.69). They apply their method in combination with stop- 
ping rules defined as in Longstaff and Schwartz [241] and, for Bermudan swap- 
tions, Andersen [12]. Their numerical results indicate excellent performance 
with reasonable computational effort. As they point out, in addition to provid- 
ing a confidence interval, the spread between the estimated lower and upper 
bounds gives an indication of whether additional computational effort should 
be dedicated to improving the exercise policy — for example, by increasing 
the number of basis functions. 


Numerical Example 


Table 8.3 displays numerical results obtained by applying these dual estimates 
to the American max option of Example 8.6.1. The labels “Dual-V” and 
“Dual-r” refer, respectively, to dual estimates based on approximate value 
functions (as in (8.66)) and dual estimates based on stopping rules (as in 
(8.69)). Each of these methods requires simulating subpaths; the table shows 
results for n = 10 and n = 100 subpaths per replication in the last two pairs 
of columns. The regression and low estimates in the first pair of columns are 
repeated from Tables 8.1 and 8.2 for comparison. The dual results in Table 8.3 
are based on 4000 initial paths used to estimate regression coefficients followed 
by 100 independent paths along which the duals are computed using n = 10 
or n = 100 subpaths at each step; the entire procedure is then replicated 
100 times to allow estimation of standard errors. We consider initial values 
S(0) = 100, 110, and 90, for which the correct prices are 13.90, 21.34, and 8.08. 
The first three rows in Table 8.3 use the first set of basis functions displayed 
in Table 8.1 (a poor set), and the second three rows use the last set of basis 
functions from Table 8.1 (a much better set). Standard errors for all estimates 
in the table range between 0.02 and 0.03. 

The dual estimates are markedly better (giving tighter upper bounds) with 
n = 100 subpaths than n = 10. Recall that the subpaths are used to estimate 
the conditional expectations defining the A; so increasing n gives better es- 
timates of these conditional expectations; the results in the table show the 
importance of estimating these accurately. The results also indicate that the 
two dual estimates give similar values in this example. When both methods 
use the same number of paths and subpaths, and when these sample sizes are 
large, we expect that Dual-r will usually give better estimates than Dual-V 
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because it is based directly on an implementable exercise rule rather than an 
approximate value function. However, Dual-V has the advantage of using only 
single-step subpaths and thus requires less computing time per path. A defin- 
itive comparison of the two methods would evaluate each under an optimal 
implementation, and this would require finding the optimal combination of 
sample sizes for the initial set of paths used for regression, the second-phase 
paths, and the subpaths. 


n = 10 n = 100 

S(0) Regression Low Dual-V Dual-r Dual-V Dual-r 

100 15.74 13.62 15.86 15.96 14.58 14.26 

110 24.52 20.79 24.09 24.56 22.38 21.94 

90 9.49 7.93 9.43 9.21 8.59 8.24 

100 14.08 13.78 15.46 15.82 14.16 14.16 

110 21.38 21.26 23.55 24.21 21.68 21.72 

90 8.27 7.99 9.07 9.16 8.25 8.20 
Table 8.3. Estimates for the American max option of Example 8.6.1. Correct values 
for S(0) = 100, 110, and 90 are 13.90, 21.34, and 8.08. Standard errors for all 

estimates in the table are approximately 0.02-0.03. 


The dual estimates with n = 100 are noticeably better (lower) than the 
regression estimates in the first three rows, but roughly the same as the re- 
gression estimates in the last three rows. The dual estimates are thus most 
valuable when the chosen basis functions do not provide a good fit to the 
option value. This pattern is explained in part by a connection between the 
regression estimates (and any other estimate based on approximate dynamic 
programming) and the dual estimates, to which we now turn. 


Connection with Regression 


If the regression relation (8.47) holds — meaning that the optimal continua- 
tion value is linear in the basis functions — then the regression residuals are 
the optimal martingale differences. To see why, observe from the definition 
(8.59) of the A; that 


Vivi (Xi+1) = E| Viz (X41) Xi] + Ai, 
and that the equivalent properties (8.46) and (8.47) then imply 
Vita (Xit1) = Bf Y(X) + Aipa. 


This shows that A;+1 is the residual in the regression of Vji1(X;41) against 
w(X;). If we knew with certainty that the regression relation (8.47) held, then 
this would provide a simple mechanism for finding the A; as a by-product of 
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estimating the coefficients ĝ;. However, in the more typical case that (8.47) 
holds only approximately, the regression residuals will not satisfy the martin- 
gale difference condition (8.59) and are therefore not guaranteed to provide a 
valid upper bound on the option price. 

Even if the regression equation (8.47) does not hold exactly, this link 
sheds light on the relation between the regression-based estimator calculated 
through the dynamic programming recursion and the dual estimators. We end 
this section by developing this idea. 

Given coefficients vectors ĝ;, i = 1,...,m—1, estimated or exact, consider 
the sequence of value estimates 


Vi(Xi) = max{hi(Xi), br d(Xi)}, i=1,...m—-1, (8.72) 
applied to a path X1,...,Xm of the underlying Markov chain. Set Vo(Xo) = 


E[Vi(X1)]. The limit as b — oo of the regression-based estimator defined 
through (8.51) has this form; this follows from Theorem 2 of Tsitsiklis and 


Van Roy [351]. 
Define residuals €;, i = 1,...,m— 1, through the equation 
Vizi (Xin) = Bl Y(Xi) + éiri, 
and 


Vi (X1) = Vo(Xo) + €. 


Using exactly the same algebraic steps leading to (8.62) and (8.63), we find 
that 


k 
Cao (mov z Ye) (8.73) 


i=] 


Thus, the approximation WX 0) defined through regression and dynamic pro- 
gramming admits a representation analogous to the dual formulation of the 
true value Vo(Xo) in (8.63), except that the cumulative sum of the €; is not 
in general a martingale. 

This explains the pattern of numerical results in Table 8.3. With a good 
choice of basis functions, 6; (X;) is nearly equal to the continuation value 
at X; so the regression residual €;4; is nearly equal to the optimal martingale 
difference A;, and the regression approximation W(X o) is nearly the same as 
the dual value. With a poor choice of basis functions, the residuals are farther 
from the optimal A; and we see a greater difference between the regression 
estimate and the dual value. 

These observations apply more generally to any approximation computed 
through backward induction (including a stochastic mesh with likelihood ratio 
weights), even without the use of regression. Suppose (8.71)—(8.72) hold with 


6; (X;) replaced with an arbitrary approximation C; to the continuation 
value at step i. Define residuals through 
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Vina (Xi+1) = Ci( Xi) + eq, 


and then (8.73) holds. This shows that the high-biased estimates calculated 
through backward induction in Sections 8.3 and 8.5-8.6 have nearly the same 
form as high-biased estimates based on duality. 


8.8 Concluding Remarks 


The pricing of American options through Monte Carlo simulation is an active 
and evolving area of research. We have limited this chapter to techniques that 
address the problem generally, and not discussed the many ad hoc methods 
that have been developed for specific models. Some of the history of the topic 
and some early proposals are surveyed in Boyle et al. [53]. 

Because the field is still in flux, comparisons and conclusions are poten- 
tially premature. Based on the current state of knowledge and an admittedly 
subjective view of the field, we offer the following summary of the methods 


discussed in this chapter. 


o Parametric approximations provide a relatively simple way of computing 
rough estimates, particularly in problems for which good information is 
available about the main drivers of early exercise. A sound implementation 
of this approach uses a second pass (Step 3 in Section 8.2) to compute a low- 
biased estimator and also an estimate of the dual value. The gap between 
the two estimates reflects the quality of the approximation. 

o The random tree method is very simple to implement and relies on no more 
than the ability to generate paths of the underlying processes. It provides a 
conservative confidence interval that shrinks to the true value under minor 
regularity conditions. Because its complexity is exponential in the number of 
exercise dates, it is suitable only for options with few exercise opportunities 
— not more than about five. 

o State-space partitioning is relatively insensitive to the number of exercise 
dates but its complexity is exponential in the dimension of the state space, 
making it inapplicable to high-dimensional problems. 

o The stochastic mesh and regression-based methods provide the most pow- 
erful techniques for solving high-dimensional problems with many exercise 
opportunities. Mesh weights based on likelihood ratios are theoretically the 
most general but have two shortcomings: they require evaluating transition 
densities for the underlying processes (which may be unknown or may fail 
to exist), and their computational complexity is O(b*), with b the num- 
ber of paths. Regression-based methods correspond to an implicit choice 
of mesh weights. They reduce the O(b*) complexity of general weights to 
O(Mb), with M the number of basis functions. This makes it practical to 
simulate many more paths and thus achieve lower variability, though not 
necessarily greater accuracy. Accuracy depends on the choice of basis func- 
tions, which may require experimentation or-good information about the 
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structure of the problem. A sound implementation of either the stochastic 
mesh or regression-based estimator uses a second pass to estimate a low- 
biased estimator. This can be paired with an estimate of the dual value to 
produce an interval for the true price. The first-pass estimator based on 
dynamic programming is often biased high; this is essentially always true 
using likelihood ratio weights and typically true using regression. In the case 
of regression, the gap between high and low estimates provides a measure 
of the quality of fit achieved with the chosen basis functions and can thus 
alert the user to a need to change the set of basis functions. 


Lal 


9 


Applications in Risk Management 


This chapter discusses applications of Monte Carlo simulation to risk manage- 
ment. It addresses the problem of measuring the risk in a portfolio of assets, 
rather than computing the prices of individual securities. Simulation is use- 
ful in estimating the profit and loss distribution of a portfolio and thus in 
computing risk measures that summarize this distribution. We give particular 
attention to the problem of estimating the probability of large losses, which 
entails simulation of rare but significant events. We separate the problems of 
measuring market risk and credit risk because different types of models are 
used in the two domains. 

There is less consensus in risk management around choices of models and 
computational methods than there is in derivatives pricing. And while sim- 
ulation is widely used in the practice of risk management, research on ways 
of improving this application of simulation remains limited. This chapter em- 
phasizes a small number of specific techniques for specific problems in the 
broad area of risk management. 


9.1 Loss Probabilities and Value-at-Risk 


9.1.1 Background 


A prerequisite to managing market risk is measuring market risk, especially 
the risk of large losses. For the large and complex portfolios of assets held 
by large financial institutions, this presents a significant challenge. Some of 
the obstacles to risk measurement are administrative — creating an accurate, 
centralized database of a firm’s positions spanning multiple markets and asset 
classes, for example — others are statistical and computational. Any method 
for measuring market risk must address two questions in particular: 


o What statistical model accurately yet conveniently describes the move- 
ments in the individual sources of risk and co-movements of multiple 


sources of risk affecting a portfolio? 


spia 
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o How does the value of a portfolio change in response to changes in the 
underlying sources of risk? 


The first of these questions asks for the joint distribution of changes in risk 
factors — the exchange rates, interest rates, equity, and commodity prices to 
which a portfolio may be exposed. The second asks for a mapping from risk 
factors to portfolio value. Once both elements are specified, the distribution of 
portfolio profit and loss is in principle determined, as is then any risk measure 
that summarizes this distribution. 

Addressing these two questions inevitably involves balancing the complex- 
ity required by the first with the tractability required by the second. The 
multivariate normal, for example, has known deficiencies as a model of mar- 
ket prices but is widely used because of its many convenient properties. Our 
focus is more on the computational issues raised by the second question than 
the statistical issues raised by the first. It is nevertheless appropriate to men- 
tion two of the most salient features of the distribution of changes in market 
prices and rates: they are typically heavy-tailed, and their co-movements are 
at best imperfectly described by their correlations. The literature document- 
ing evidence of heavy tails is too extensive to summarize — an early refer- 
ence is Mandelbrot [246]; Campbell, Lo, and MacKinlay [74] and Embrechts, 
Klüppelberg, and Mikosch [111] provide more recent accounts. Shortcomings 
of correlation and merits of alternative measures of dependence in financial 
data are discussed by, among others, Embrechts, McNeil, and Straumann 
[112], Longin and Solnik [240], and Mashal and Zeevi [255]. We revisit these 
issues in Section 9.3, but mostly work with simpler models. 

To describe in more detail the problems we consider, we introduce some 


notation: 


S = vector of m market prices and rates; 
At = risk-measurement horizon; 
AS = change in S over interval At; 
V(S,t) = portfolio value at time t and market prices S; 


L = loss over interval At 
= -AV =V(S,t) -V(S + AS,t+ At); 
Fy, (2) = P(L < x), the distribution of L. 


The number m of relevant risk factors could be very large, potentially reaching 
the hundreds or thousands. In bank supervision the interval At is usually 
quite short, with regulatory agencies requiring measurement over a two-week 
horizon, and this is the setting we have in mind. The two-week horizon is often 
interpreted as the time that might be required to unwind complex positions 
in the case of an adverse market move. In other areas of market risk, such 
as asset-liability management for pension funds and insurance companies, the 
relevant time horizon is far longer and requires a richer framework. 
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The notation above reflects some implicit simplifying assumptions. We con- 
sider only the net loss over the horizon At, ignoring for example the maximum 
and minimum portfolio value within the horizon. We ignore the dynamics of 
the market prices, subsuming all details about the evolution of S in the vector 
of changes AS. And we assume that the composition of the portfolio remains 
fixed, though the value of its components may change in response to the mar- 
ket movement AS and the passage of time At, which may bring assets closer 
to maturity or expiry. 

The portfolio’s value-at-risk (VAR) is a percentile of its loss distribution 
over a fixed horizon At. For example, the 99% VAR is a point x, satisfying 


1 — Fr (£p) = P(L > £p) =p 


with p = 0.01. (For simplicity, we assume throughout that Fz is continuous so 
that such a point exists; ties can be broken using (2.14).) A quantile provides 
a simple way of summarizing information about the tail of a distribution, and 
this particular value is often interpreted as a reasonable worst-case loss level. 
VAR gained widespread acceptance as a measure of risk in the late 1990s, in 
large part because of international initiatives in bank supervision; see Jorion 
[203] for an account of this history. VAR might more accurately be called a 
measure of capital adequacy than simply a measure of risk. It is used primarily 
to determine if a bank has sufficient capital to sustain losses from its trading 
activities. 

The widespread adoption of VAR has been accompanied by frequent crit- 
icism of VAR as a measure of risk or capital adequacy. Any attempt to sum- 
marize a distribution in a single number is open to criticism, but VAR has a 
particular deficiency stressed by Artzner, Delbaen, Eber, and Heath [19]: com- 
bining two portfolios into a single portfolio may result in a VAR that is larger 
than the sum of the VARs for the two original portfolios. This runs counter 
to the idea that diversification reduces risk. Many related measures are free 
of this shortcoming, including the conditional excess E[L|L > x], calling into 
question the appropriateness of VAR. 

The significance of VAR (and related measures) lies in its focus on the tail 
of the loss distribution. It emphasizes a probabilistic view of risk, in contrast 
to the more formulaic accounting perspective traditionally used to gauge cap- 
ital adequacy. And through this probabilistic view, it calls attention to the 
importance of co-movements of market risk factors in a portfolio-based ap- 
proach to risk, in contrast to an earlier “building-block” approach that ignores 
correlation. (See, for example, Section 4.2 of Crouhy, Galai, and Mark [93].) 
We therefore focus on the more fundamental issue of measuring the tail of 
the loss distribution, particularly at large losses — i.e., on finding P(L > x) 
for large thresholds x. Once these loss probabilities are determined, it is a 
comparatively simple matter to summarize them using VAR or some other 
measure. 

The relevant loss distribution in risk management is the distribution un- 
der the objective probability measure describing observed events rather than 
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the risk-neutral or other martingale measure used as a pricing device. His- 
torical data is thus directly relevant in modeling the distribution of AS. One 
can imagine a nested simulation (alluded to in Example 1.1.3) in which one 
first generates price-change scenarios AS, and then in each scenario simulates 
paths of underlying assets to revalue the derivative securities in a portfolio. In 
such a procedure, the first step (sampling AS) takes place under the objective 
probability measure and the second step (sampling paths of underlying assets) 
ordinarily takes place under the risk-neutral or other risk-adjusted probability 
measure. There is no logical or theoretical inconsistency in this combined use 
of the two measures. It is useful to keep the roles of the different probability 
measures in mind, but we do not stress the distinction in this chapter. Over 
a short interval At, it would be difficult to distinguish the real-world and 
risk-neutral distributions of AS. 


9.1.2 Calculating VAR 


There are several approaches to calculating or approximating loss probabilities 
and VAR, each representing some compromise between realism and tractabil- 
ity. How best to make this compromise depends in part on the complexity of 
the portfolio and on the accuracy required. We discuss some of the principal 
methods because they are relevant to our treatment of variance reduction in 
Section 9.2 and because they are of independent interest. 


Normal Market, Linear Portfolio 


By far the simplest approach to VAR assumes that AS has a multivariate 
normal distribution and that the change in value AV (hence also the loss L) 
is linear in AS. This gives L a normal distribution and reduces the problem 
of calculating loss probabilities and VAR to the comparatively simple task of 
computing the mean and standard deviation of L. 

It is customary to assume that AS has mean zero because over a short 
horizon the mean of each component AS; is negligible compared to its stan- 
dard deviation, and because mean returns are extremely difficult to estimate 
from historical data. Suppose then that AS has distribution N(0,Xig) for 
some covariance matrix ig. Estimation of this covariance matrix is itself a 
significant challenge; see, for example, the discussion in Alexander [10]. 


Further suppose that 
AV =6'AS, (9.1) 


for some vector of sensitivities ô.. Then L ~ N(0,0?) with o? = 6'%ig6, and 
the 99% VAR is 2.330, because ©(2.33) = 0.99. 

One might object to the normal distribution as a model of market move- 
ments because it can theoretically produce negative prices and because it is 
inconsistent with, for example, a lognormal specification of price levels. But 
all we need to assume is that the change AS over the interval (t,t + At) is 
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conditionally normal given the price history up to time t. Given S;, assum- 
ing that AS; is normal is equivalent to assuming that the return AS;/S; is 
normal. For small At, 


S;(1 + AS;/S;) © S;exp(AS,/S;), 


so the distinction between normal and lognormal turns out to be relatively 
minor in this setting. 

It should also be noted that assuming that AS is conditionally normal 
imposes a much weaker condition than assuming that changes over disjoint 
intervals of length At are i.i.d. normal. In our calculation of the distribution of 
AV based on (9.1), “ig is the conditional covariance matrix for changes from 
t to t+ At, given the price history to time t. At different times t, one would 
ordinarily estimate different covariance matrices. The unconditional distribu- 
tion of the changes AS would then be a mixture of normals and could even be 
heavy-tailed. This occurs, for example, in GARCH models (see Section 8.4 of 
Embrechts et al. {111]). Similar ideas are implicit in the discretization meth- 
ods of Chapter 6: the increments of the Euler scheme (6.2) are conditionally 
normal at each step, but the distribution of the state can be far from normal 


after multiple steps. 


Delta-Gamma Approximation 


The assumption that V is linear in S holds, for example, for a stock portfolio 
if S is the vector of underlying stock prices. But a portfolio with options has 
a nonlinear dependence on the prices of underlying assets, and fixed-income 
securities depend nonlinearly on interest rates. The model in (9.1) is thus not 
universally applicable. 

A simple way to extend (9.1) to capture some nonlinearity is to add a 
quadratic term. The quadratic produced by Taylor expansion yields the delta- 
gamma approximation 


AV m ĈE At +67 AS + LASTTAS (9.2) 
where 
eee 32V 


Z eS 
ds; “88,08; 


are first and second derivatives of V evaluated at (S(t), t). This in turn yields 
a quadratic approximation to L = — AV. 

For this approximation to have practical value, the coefficients must be 
easy to evaluate and finding the distribution of the approximation must be 
substantially simpler than finding the distribution of L itself. As discussed in 
Chapter 7, calculating 6 and I can be difficult; however, these sensitivities 
are routinely calculated for hedging purposes by individual trading desks and 
can be aggregated (at the end of the day, for example) for calculation of 
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firmwide risk. This is a somewhat idealized description — for example, many 
off-diagonal gammas may not be readily available — but is sufficiently close 
to reality to provide a valid premise for analysis. 

If AS has a multivariate normal distribution, then finding the distribution 
of the approximation in (9.2) requires finding the distribution of a quadratic 
function of normal random variables. This can be done numerically through 
transform inversion. We detail the derivation of the transform because it is 


relevant to the techniques we apply in Section 9.2. 


Delta-Gamma: Diagonalization 


The first step derives a convenient expression for the approximation. As in 
Section 2.3.3, we can replace the correlated normals AS ~ N(0,%5) with 
independent normals Z ~ N(0,1) by setting 


AS=CZ with COC! =%Xz¢. 
In terms of Z, the quadratic approximation to L = —AV becomes 
Lxea—(C'5)'Z-4Z'(C'TC)Z (9.3) 


with a = —(At)OV/0t deterministic. 

It is convenient to choose the matrix C to diagonalize the quadratic term 
in (9.3), and this can be accomplished as follows. Let Ĉ be any square matrix 
for which CC! = dis, such as the one found by Cholesky factorization. The 
matrix -4C'TC is symmetric and thus admits the representation 


—40'TC =UAU' 


in which 


Am 


is a diagonal matrix and U is an orthogonal matrix (UU T = Į) whose columns 
are eigenvectors of —iC'TC. The A; are eigenvalues of this matrix and also 


of -4T Ys. Now set C = CU and observe that 
CC' =Cuu'C' = £s 


and 
—4C'TC = -4U ' (Č'TÕ)U =U'(UAU')JU =A. 


Thus, by setting b = —C! ô we can rewrite (9.3) as 
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Laeas+b ZEZ AZ 
=a +X (bjZj +A;Z?) =Q. (9.4) 
j=l 


The delta-gamma approximation now becomes P(L > x) + P(Q > x), so we 
need to find the distribution of Q. 


Delta-Gamma: Moment Generating Function 


We determine the distribution of Q by deriving its moment generating function 
and characteristic function. The moment generating function is finite in a 
neighborhood of the origin and, in light of the independence of the summands 
n (9.4), factors as 

Ele 4 — e? J] Ele get; Zj+àj Z3)] = 20 ehi (8). 
j=1 j 


s 


II 
jæi 


If A; = 0, then (2.26) yields %;(0) = b20? /2. Otherwise, we write 


b b 

a j 
bjj + Aj; = A; (z i+ ) e 
which is a linear transformation of a noncentral chi-square random variable; 
see (3.72). Using the identity (equation (29.6) of Johnson, Kotz, and Balakr- 


ishnan [202]), 


Efexp(@(Z; +95) = (1- 20)71/? exp (; i) 


for 0 < 1/2, we arrive at the expression 


VO =0+ > WE -0+4 (he 


for log Elexp(@Q)]|, the cumulant generating function of Q. This equation 


holds for all 6 satisfying max; 0A; < 1/2. 
The characteristic function of @ is obtained by evaluating the moment 


generating function at a purely imaginary argument: 
ol(u) = Efe@@] = eg = VHT. 


The distribution of Q can now be computed through the inversion integral 
(Chung [85], p.153) 


“be 


— log(1 — 20) (9.5) 


ety =! 1 


P(Q<2)-PQ<e-y=+ [Re (dw) E ee) du 0.6 
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Abate, Choudhury, and Whitt [1] discuss algorithms for numerical evaluation 
of integrals of this type. To find P(Q < x), choose y large so that P(Q < 
x—y) 0. 

Although the derivation of this method involves several steps, its imple- 
mentation is relatively straightforward and fast. It is developed in this form 
by Rouvinez [310]. Studer [338] suggests replacing the ordinary Taylor ex- 
pansion in (9.2) with an It6-Taylor expansion (of the type in Section 6.3.1), 
leading to a slightly different quadratic. Britten-Jones and Schaefer |61] use 
approximations to the distribution of quadratic functions of normals in place 
of transform inversion. Extensions to non-normal risk factors are derived in 
Duffie and Pan [104] and (as discussed in Section 9.3.2) Glasserman, Heidel- 
berger, and Shahabuddin [144]. 

Figure 9.1 illustrates the delta-gamma approximation for a simple portfo- 
lio. The portfolio consists of short positions of 10 calls and 5 puts on each of 
10 underlying assets, with all options 0.10 years from expiration. We value the 
portfolio using the Black-Scholes formula for each option, and the quadratic 
approximation uses Black-Scholes deltas and gammas. The underlying assets 
are uncorrelated and each has a 0.40 volatility. 

The scatter plot in Figure 9.1 shows the results of 1000 randomly gener- 
ated scenarios AS. These scenarios are sampled from a multivariate normal 
distribution with independent components AS;, 7 = 1,...,m. Each AS; has 
mean 0 and standard deviation S;(0)o; VAt, with At = 0.04 years or about 
two weeks. For each AS, we revalue the portfolio at time t+ At and underly- 
ing prices S + AS using the Black-Scholes formula; this gives the horizontal 
coordinate for each scenario. We also compute the quadratic approximation 
by substituting AS and At in the delta-gamma formula (9.2); this gives the 
vertical coordinate for each scenario. The scatter plot illustrates the strong 
relation between the exact and approximate losses in this example. 

There is an evident inconsistency in this example between the model used 
to generate scenarios and the formula used to value the portfolio. This is 
representative of the standard practice of using fairly rough models to de- 
scribe market risk while using more detailed models to price derivatives. It’ 
should also be noted that, in theory, AS should be sampled from the objective 
probability measure describing actual market movements — whereas pricing 
formulas ordinarily depend on the risk-neutral dynamics of underlying assets 
— so some inconsistency in how we model S' for the two steps is appropri- 
ate. Most importantly, the use of the Black-Scholes formula in this example 
is convenient but by no means essential. 

Our main interest in the delta-gamma approximation lies in accelerating 
Monte Carlo simulation. Even in cases in which the method may not by itself 
provide an accurate approximation to the loss distribution, it can provide a 
powerful tool for variance reduction. 
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Delta-Gamma Approximation 
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Actual Portfolio Loss 


Fig. 9.1. Comparison of delta-gamma approximation and actual portfolio losses in 
1000 randomly generated scenarios for a portfolio of 150 options. 


Monte Carlo Simulation 


Estimating loss probabilities and VAR by simulation is simple in concept, as 
illustrated by the following algorithm: 


o For each of n independent replications 
~ generate a vector of market moves AS; 
— revalue portfolio and compute loss V (.S,t) — V(S + AS,#+ At). 


o Estimate P(L > x) using 


i 
== 1; L; > 
nH x} 


where L; is the loss on the ith replication. 


The bottleneck in this algorithm is the portfolio revaluation step. For a 
large portfolio of complex derivative securities, each revaluation may require 
running thousands of numerical pricing routines. Individual pricing routines 
may involve numerical integration, solving partial differential equations, or 
even running a separate simulation (as alluded to in Example 1.1.3). This 
makes variance reduction essential to achieving accurate estimates with rea- 
sonable computational effort. We return to this topic in Section 9.2. 

“Historical simulation” is a special case of this algorithm in which the sce- 
narios AS are drawn directly from historical data — daily price changes over 
the past year, for example. Past observations of changes AS in the underly- 
ing risk factors are applied to the current portfolio to produce a histogram 
of changes in portfolio value. This approach is sometimes defended on the 
grounds that it is easy to explain to a nontechnical audience; but the his- 
torical distribution of AS inevitably has gaps, especially in the tails. Fitting 
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a theoretical distribution (or at least smoothing the tails) and then apply- 
ing Monte Carlo gets around this artificial feature of the purely data-driven 


implementation. 


Quantile Estimation 


The simulation algorithm above estimates loss probabilities P(Z > x) rather 
than VAR and it is in this context that we discuss variance reduction tech- 
niques. Before doing so, we briefly discuss the estimation of VAR itself. 

Let Fron denote the empirical distribution of portfolio losses based on n 
simulated replications, 


n 


T e - So 1{Li < a}. 


i=1 


A simple estimate of the VAR at probability p (e.g., p = 0.01) is the empirical 


quantile 
Lp = Fr =p) 


with the inverse of the piecewise constant function Fyn defined as in (2.14). 
Applying piecewise linear interpolation to Fron before taking the inverse gen- 
erally produces more accurate quantile estimates. (See Avramidis and Wilson 
[30] for a comparison of simulation-based quantile estimators, including sev- 
eral using antithetic and Latin hypercube sampling.) 

Under minimal conditions (as in Serfling [326], p.75), the empirical quantile 
Zp converges to the true quantile x, with probability 1 as n — oo. A central 
limit theorem provides additional information on the quality of convergence. 
For this we assume that L has a strictly positive density f in a neighborhood 
of xp. ‘Then 

Vis = tp) > VP-P) vig 1), (9.7) 
f (Zp) 

as shown, for example, in Serfling [326], p.77. The term p(1— p) is the variance 
of the loss indicator 1{L > xp}. We see from (9.7) that this variance is mag- 
nified by a factor of 1/f(x,)? when we estimate the quantile rather than the 
loss probability. This factor is potentially very large, especially if p is small, 
because the density is likely to be close to zero in this case. We will see that 
by reducing variance in estimates of P(L > x) for x near zp, we can reduce 
the variance in this central limit theorem for the VAR estimate. 

Equation (9.7) provides the basis for a large-sample 1 — a confidence in- 


terval for £p of the form 


/ p(1 — p) 


vea- 
PR Fap) 


9.1 Loss Probabilities and Value-at-Risk 491 


with 1 — ®(za/2) = a/2. The interval remains asymptotically valid with f (zp) 
replaced by f(ĉp) if f is continuous at zp. However, its reliance on evaluating 
the density f makes this interval estimate impractical. To avoid this step, one 
may divide the sample of n observations into batches, compute an estimate Zp 
from each batch, and form a confidence interval based on the sample standard 
deviation of the estimates across batches. 

An alternative confidence interval not relying on a central limit theorem 
and valid for finite n uses the fact that the number of samples exceeding £p 
has a binomial distribution with parameters n and p. Let 


Lay <= La) S++: S Lin) (9.8) 


denote the order statistics of the L;. An interval of the form [Z;,), L/s)), r < 8, 
covers £p with probability 


eo 
n tit —t 
Pll Sty <b) = (7) 0-2 3 


t=r 


The values r and s can be chosen to bring this probability close to the desired 
confidence level 1 — a. These values do not depend on the loss distribution 


Fz. 


Approximate Simulation 


We noted previously that the bottleneck in using Monte Carlo simulation for 
estimating loss probabilities and VAR lies in portfolio revaluation. It follows 
that there are two basic strategies for accelerating simulation: 


o reduce the number of scenarios required to achieve a target precision by 
applying a variance reduction technique; or 

o reduce the time required for each scenario through approximate portfolio 
revaluation. 


Our focus in the next section is on a specific set of techniques for the first 
strategy. The second strategy is also promising but has received less systematic 
study to date. 

A general approach to approximate revaluation undertakes exact valuation 
in a moderate number of scenarios (generated randomly or deterministically) 
and fits a function to the value surface using some form of interpolation or 
nonlinear regression: If the fitted approximation is easy to evaluate, then using 
it in place of V (S + AS, t+ At) in the simulation algorithm makes it feasible 
to generate a much larger number of scenarios and compute (approximate) 
losses in each. The challenge in this approach lies in selecting an appropriate 
functional form and — especially for high-dimensional S — computing enough 
exact values to obtain a good fit. 

The problem simplifies if the portfolio value is “separable” in the sense 
that each instrument in the portfolio depends on only a small number of risk 
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factors, though the total number of risk factors may be large. In the extreme 
case of total separation, the portfolio value can be expressed as 


VS) = Visor 2 aD 


where V; gives the value of all instruments sensitive to S; (and only to S;). 

Suppose evaluating the functions V; is time-consuming. Evaluating each 
V; at just two values of S; yields the value of V at 2” points if we take all 
vectors (.5),..., Sm) formed by combinations of the pairs of S} values. Similar 
though less dramatic savings apply if V decomposes as a sum of functions of, 
say, 2-5 arguments each if m is much larger than 5. 

Variants of this general approach are described by Jamshidian and Zhu 
[198], Picoult [297], and Shaw [329]. Jamshidian and Zhu [198] propose a 
partitioning and weighting of the set of possible S, using principal components 
analysis (keeping only the most important components) to reduce the number 
of risk factors. Abken [2] tests the method on portfolios of multicurrency 
interest rate derivatives and reports mixed results. 


9.2 Variance Reduction Using the Delta-Gamma 
Approximation 


We stressed in Chapter 4 that effective variance reduction takes advantage 
of special features of a simulated model. In simulating portfolio losses we 
should therefore look for information available that could be used to im- 
prove precision. One source of information are the results of simulations of 
the portfolio’s value in recent days under presumably similar market condi- 
tions; this potentially effective direction has not received systematic study to 
date. Another source of information for variance reduction is the quadratic 
delta-gamma approximation (9.2) to portfolio value (or the linear delta ap- 
proximation (9.1)), and this is the case we treat in detail. This strategy for 
variance reduction leads to interesting theoretical results (especially in impor- 
tance sampling) and effective variance reduction. The techniques we discuss 
are primarily from Glasserman, Heidelberger, and Shahabuddin (henceforth 
GHS) [143, 142, 144]. 

In order to use the delta-gamma approximation, we assume throughout 
this section that the changes in risk factors AS are multivariate normal 
N(0, Xig). Section 9.3 develops extensions to a class of heavy-tailed distri- 
butions. The techniques discussed in this section are applicable with any 
quadratic approximation and thus could in principle be used with a hedged 
portfolio in which all deltas and gammas are zero, provided a less local 
quadratic approximation could be computed. As discussed in Section 9.1.2, 
the delta-gamma approximation has the advantage that it may be available 
with little or no additional computational effort. 
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9.2.1 Control Variate 


The simplest way to take advantage of the delta-gamma approximation L ~ Q 
applies it as a control variate. On each replication (meaning for each simulated 
market move AS), evaluate Q along with the portfolio loss L. Using the 
representation in (9.4) is convenient if we generate AS as CZ with Z ~ 
N(0,Z) and C the matrix constructed in the derivation of (9.4). Let (L;, Qi), 
i= 1,...,n, be the values recorded on n independent replications. A control 
variate estimator of P(L > x) is given by 


1- FE (@) =PO HL > 2} Â : SO HQ: > v} PQ > n) . (9.9) 


w=1 t= 1 


An estimate 6 of the variance-minimizing coefficient can be computed from 
the (L;,Q;) as explained in Section 4.1.1. 

As suggested by the notation in (9.9), the threshold y used for Q need not 
be the same as the one applied to L, though it often would be in practice. 
We could construct multiple controls by using multiple thresholds for Q. The 
exact probability P(Q > y) needed for the control is evaluated using the 
inversion integral in (9.6). 

The control variate estimator (9.9) is easy to implement and numerical 
experiments reported in Cardenas et al. [76] and GHS [142] indicate that it 
often reduces variance by a factor of 2-5. However, the estimator suffers from 
two related shortcomings: 


o Even if L and Q are highly correlated, the loss indicators 1{L > x} and 
1{Q > y} may not be, especially for large x and y. This is relevant because 
the variance reduction achieved is (at best) p*, with p the correlation 
between the indicators; see the discussion in Section 4.1.1. 

o Ifz and y are large, few simulations produce values of L or Q greater than 
these thresholds. This makes it difficult to estimate the optimal coefficient 
6 and thus further erodes the variance reduction achieved. 


Both of these points are illustrated in Figure 9.1. Although the correla- 
tion between exact and approximate losses is very high in this example, it 
is weakest in the upper-right corner, the area of greatest interest, and few 
observations fall in this area. 

An indirect illustration of the first point above is provided by the normal 
distribution. If a pair of random variables (X,Y) has a bivariate normal dis- 
tribution with correlation p, then the correlation of the indicators 1{X > u} 
and 1{Y > u} approaches zero as u — oo if |p| # 1; see Embrechts et al. 


[112]. 


Quantile Estimation 


Using the control variate to estimate a quantile zp, (at which P(L > zp) = p) 
requires inverting FF“ in (9.9) to find a point ĉp at which FF (êp) ~ 1 — p. 
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This appears to require recomputing the coefficient 6 for multiple values of 
x. But Hesterberg and Nelson [178] show that this can be avoided, as we now 


explain. 
Their method uses the connection between control variate estimators and 


weighted Monte Carlo discussed in Section 4.1.2. Using (4.19), the control 
variate estimator (9.9) can be expressed as 


i=1 


Lge 
where the weights W; depend:on Qj,...,@n and y but not on Lj,..., Ln or 
x. Thus, FFY(x) can be evaluated at multiple values of x simply by summing 


over the appropriate set of weights, without recalculation of B ; 
Through this representation of the control variate estimator of the distri- 


bution of L, the quantile estimator 


ĉp = inf{x : 1 — FE’ (x) < p} 


becomes 
&, = inf{a : ` W; < p}. 


iLy>2 
Let Loy, i =1,...,n, denote the order statistics of the L;, as in (9.8) and let 
W denote the weight corresponding to Ly). Then ĉp = Liu) where 


ip = minik: Da WwW < p}. 
i=k+1 


To smooth the estimator using linear interpolation, find œ for which 


D Ww + (1a) x W =p 


i=ip i=ip+1 
and then estimate the quantile as 
To = al (,—1) -+ (1 = alL 

This procedure is further simplified by the observation in Hesterberg and 
Nelson [178] that for (9.9), the weights reduce to 
= JnaP Sy) Xj HO; Sy}, Qi Sy, 
 aPO > y)/ Dy HO; > y}. if Qi > y. 
This shows that the control variate estimator in (9.9) weights each indicator 
1{L; > x} by the ratio of an exact and estimated probability for Q. As 
further observed by Hesterberg and Nelson [178], this makes it a poststratified 
estimator of P(L > x) with stratification variable Q and strata {Q < y} and 


{Q > y}. We will have more to say about using Q as a stratification variable 
in Section 9.2.3. 
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9.2.2 Importance Sampling 


The control variate estimator (9.9) uses knowledge of the delta-gamma, ap- 
proximation Q after simulating replications of AS in order to adjust the aver- 
age of the loss indicators 1{ L; > x}. But as already noted, if we observe few 
replications in which Q; > y, we have little information on which to base the 
adjustment. ‘This is evident in the connection with poststratification as well. 

Through importance sampling, we can use the delta-gamma approximation 
to guide the sampling of scenarios before we compute losses, rather than to 
adjust our estimate after the scenarios are generated. In particular, we can 
try to use our knowledge of the distribution of Q to give greater probability 
to “important” scenarios in which L exceeds zx. ‘This idea is developed in GHS 
(142, 143, 144]. 

Before proceeding with the details of this approach, consider the qualita- 
tive information provided by the approximation 


Le Q=a+t+ > bj)Z;+ > AZ, 
jal gat 


with the Z; independent standard normals, as in (9.4). The Z; do not corre- 
spond directly to the original changes in market prices AS}, but we can think 
of them as primitive sources of risk that drive the market changes through 
the relation AS = CZ. Also, since Z = C~'AS, each Z; could be interpreted 
as the change in value of a portfolio of linear positions in the elements of S. 

What does this approximation to L say about how large losses occur? It 
suggests that large losses occur when 


(i) Z; is large and positive for some j with b; > 0; 

(ii) Z; is large and negative for some j with b; < 0; or 

(iii) Z? is large and positive for some j with A; > 0. 

This further suggests that to make large losses more likely, we should 


(i?) give a positive mean to those Z; for which b; > 0; 
(ii) give a negative mean to those Z; for which b; < 0; and 
(iii?) increase the variance of those Z; for which A; > 0. 


Each of the changes (i’)—(iii’) increases the probability of the corresponding 
event (i)—(iii). We develop an importance sampling procedure that makes these 
qualitative changes precise. 


Exponential Twisting 


In our discussion of importance sampling in Section 4.6, we saw examples in 
which an exponential change of measure leads to dramatic variance reduction. 
If our goal were to estimate P(Q > x) rather than P(L > x), this would lead 
us to consider importance sampling based on “exponentially twisting” Q. In 
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more detail (cf. Example 4.6.2), this means defining a family of probability 
measures Py through the likelihood ratio 


Seung saint ee l 
iad e ; (9.10) 


with 7 the cumulant generating function in (9.5) and @ any real number at 
which W(@) < œ. Writing Eg for expectation under the new measure, we have 


P(Q > x) = Eo (E) 1{Q > a} e [ewong À >} | | 


To use this in simulation, we would need to generate Q from its distribution 
under Ps and then use the average of independent replications of 


7 9Q+¥(8) 1{Q> 2} 


to estimate P(Q > x). 
If 8 is positive, then Pg gives greater probability to large values of Q than 


P = Pp does, thus increasing the probability of the event {Q > x} for large 
x, which is intuitively appealing. More precisely, the second moment of this 
importance sampling estimator is 


Eo RNO > x} | =E ler POtY IQ > x} Se OTEL, (011) 


which decreases exponentially in z if @ is positive. 
This idea extends to the more relevant problem of estimating P(L > <x). 


Using Pg as defined in (9.10), we have 
P(L > x) = Eg EOE > z} l 


the expression inside the expectation on the right is an unbiased importance 
sampling estimator of the loss probability. Its second moment is 


E [eOr > x} , 


which is small if 0 > 0 and Q is large on the event {L > x}. 


Sampling from the Twisted Distribution 


To use this estimator, we need to be able to generate independent replications 


of 
a ee x} 


under the 9-twisted distribution. In other words, we need to be able to simu- 
late the pair (Q, L) under Pg. But recall from the generic simulation algorithm 
of Section 9.1.2 that, even in the absence of importance sampling, we do not 
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generate (Q, L) directly; instead we generate Z and then evaluate Q and L 
from Z (through AS = CZ). Thus, (Q,L) = f(Z) for some deterministic 
function f. Sampling Z from the standard multivariate normal distribution 
gives (Q, L) its distribution under the original probability measure P; to sim- 
ulate (Q, L) under Pg, it suffices to sample Z from its distribution under the 


new measure. 
What is the distribution of Z under P}? In other words, what is 


P(Zı < Ajpes TEN < Lat) = E POA < Claas a si adl 


It is shown in GHS [143] that, rather remarkably, the P-distribution of 
Z is still multivariate normal, but with a new mean vector and covari- 
ance matrix determined by the delta-gamma approximation. In particular, 
Z ~ N(u(@), E(0)) where £(0) is a diagonal matrix with diagonal entries 
o5 (8), P 
1 
J 20 
MAE BS OL a; (0) = [0 (9.12) 
Recall that in (9.5) we required 2A; < 1 so that W(0) < oo. 
To show that this is indeed the Po-distribution of Z, it is easiest to argue 
in the opposite direction. For arbitrary u and X, the likelihood ratio relating 
the density of N (u, £) to N(0, J) is given by 


=| /? exp (~3(Z — w) "U7 1(Z — w)) (9.13) 
exp (-32" Z) | 
the ratio of the two densities evaluated at Z. Substituting the specific para- 
meters (9.12) and applying some simplifying algebra shows that this reduces 
to exp(0Q — y(0)), with Q related to Z through (9.4). Because this is the 
likelihood ratio used to define Ps, we may indeed conclude that under Py the 
Z; are independent normals with the means and variances in (9.12). If u and 
X are chosen arbitrarily, (9.13) depends on the entire vector Z. For the spe- 
cial case of (9.12), the dependence of the likelihood ratio on Z collapses to 
dependence on Q. This is a key feature of this importance sampling strategy. 
The new means and variances defined by (9.12) have the qualitative fea- 
tures (i’)—(iii’?) listed above for any 0 > 0, so (9.10) is consistent with these 
insights into how the delta-gamma approximation should guide the sampling 
of scenarios. If all the A; are close to zero, we might use just the linear term 
in Q (a delta-only approximation) for importance sampling. The parameters 
in (9.12) would then reduce to a change of mean for the vector Z in a manner 
consistent with features (i’) and (ii’). 


Importance Sampling Algorithm 


We can now summarize the ideas above in the following algorithm for esti- 
mating portfolio loss probabilities through importance sampling: 
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1. Choose a value of 6 > 0 at which ~(@) < oo. (More on this shortly.) 
2. For each of n replications 
(a) generate Z from N(u(@), U(@)) with parameters given in (9.12); 
(b) evaluate Q from Z using (9.4); 
(c) set AS = CZ; 
(d) calculate portfolio value V(S + AS,t + At) and loss L; 
(e) calculate 


E CRTAO Sa (9.14) 


3. Calculate average of (9.14) over the n replications. 


Rather little in this algorithm differs from the standard Monte Carlo al- 
gorithm given in Section 9.1.2. In particular, the core of the algorithm — 
invoking all the routines necessary to revalue the portfolio in step 2(d) — 
is unchanged. Step 2(a) is only slightly more complicated than generating Z 
from N(0,J). Step 2(c) would be needed even without importance sampling, 
though here we require C to be the specific matrix constructed in deriving 
(9.4) rather than an arbitrary matrix for which CC! = Dg. 


Choice of Twisting Parameter 


Step 1 of the importance sampling algorithm requires us to choose @, so we 
now discuss the selection of this parameter. In the absence of additional infor- 
mation, we choose a value of 0 that would be effective in estimating P(Q > x) 
and apply this value in estimating P(L > x). 

In (9.11) we have an upper bound on the second moment of the importance 
sampling estimator of P(Q > x) for all x and all 0 at which (0) < oo. We can 
minimize this upper bound for fixed x by choosing 0 to minimize y(80)—0x. The 
fact that w is convex (because it is a cumulant generating function) implies 
that this expression is minimized at 0}, the root of the equation 


This equation is illustrated in Figure 9.2. 

In addition to minimizing an upper bound on the second moment of the 
estimator, the parameter @, has an interpretation that sheds light on this ap- 
proach to importance sampling. Equation (9.10) defines an exponential family 
of probability measures, a concept we discussed in Section 4.6.1, especially in 
Example 4.6.2. As a consequence, we have 


Y'(0) = Eo[Q] _ (9.16) 


for any 0 at which 7(@) is finite. This follows from differentiating the definition 
(0) = log Elexp(0Q)] to get 
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Fig. 9.2. Twisting parameter 8 satisfying w’(6) = z. 


e? 
Y (0) — ay oe ElQe ee), 


Choosing the parameter 0 to use for importance sampling is thus equivalent to 
choosing Eg/[Q], the expected value of the delta-gamma approximation under 
the new distribution. By choosing 0 = 6, as in (9.15), we are sampling from 
a distribution under which 


Eo, [Q] = z. 


Whereas x was in the tail of the distribution under the original distribution, 
it is near the center of the distribution when we apply importance sampling. 


Asymptotic Optimality 


GHS [143] prove an asymptotic optimality result for this approach to impor- 
tance sampling when applied to estimation of P(Q > x). This should be in- 
terpreted as indirect evidence that the approach is also effective in estimating 
the actual loss probability P(L > x). Numerical examples in GHS [142, 143] 
applying the approach to estimating loss probabilities in test portfolios lend 
further support to the method. 

The asymptotic optimality result (from Theorems 1 and 2 of [143]) is 
as follows. Suppose Amax = MaXi<i<m 4, the largest of the eigenvalues of 
—[Sig/2 (as in (9.4)), is strictly positive; then the tail of Q satisfies 


P(Q > x) = exp (—5AmaxZ + o(x)) l 
and the second moment of the importance sampling estimator satisfies 


Eo, eNO > x} | = exp (—Amaxt + 0(2)). 
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Thus, as x increases, the second moment decreases at twice the exponential 
rate of the probability itself. As explained in Example 4.6.3, this is the fastest 
possible rate of decrease of any unbiased estimator. In this sense, the impor- 
tance sampling estimator is asymptotically optimal. 

Theorem 3 of GHS [143] establishes an asymptotic optimality result when 
Amax < 0; this case is less interesting because it implies that Q is bounded 
above. Theorem 4 of GHS [143] analyzes the effect of letting the number of 
risk factors m increase and establishes a type of asymptotic optimality in this 
case as well. 

To estimate the tail P(Q > y) at many values of y — in order to estimate 
a quantile, for example — one would want to use estimators of the form 


e 82 Q+¥(82)1 (QE > y}, 


rather than use a separate value of @ for each point y. Theorem 5 of GHS 
[143] shows that this can be done without sacrificing asymptotic optimality. 
More fundamentally, the effectiveness of the estimator is not very sensitive to 
the value of 6 used. In practice, to estimate loss probabilities over a range of 
large values of y, it is advisable to choose 0 = @, for x near the middle or left 
endpoint of the range. At loss thresholds that are not “large” there is no need 
to apply importance sampling. 7 

Through a result of Glynn [154], variance reduction for tail probabilities 
using importance sampling translates to variance reduction for quantile esti- 
mates. In the central limit theorem (9.7) for quantile estimation, the factor 
p(1 —p) is the variance of the indicator function of the event that the quantile 
£p is exceeded. With importance sampling, this factor gets replaced by the 
variance of the importance sampling estimator of the probability of this event. 
See Theorem 6 of GHS [144]. 


9.2.3 Stratified Sampling 


To further reduce variance, we now apply stratified sampling to the estimator 
(9.14) by stratifying Q. If the true loss L were exactly equal to the Q, this 
would remove all variance, in the limit of infinitely fine stratification. (See 
the discussion following equation (4.46).) This makes the approach attractive 
even if Q is only an approximation to L. 

Stratifying Q in (9.14) is a special case of a more general variance reduction 
strategy of applying importance sampling by exponentially twisting a random 
variable and then stratifying that random variable. This is then equivalent to 
stratifying the likelihood ratio and thus to removing variance that may have 
been introduced through the likelihood ratio. We encountered this strategy in 
Section 4.6.2 where we applied importance sampling to add a mean vector u 
to a normal random vector Z and then stratified u! Z. 

Recall from our general discussion in Section 4.3 that to apply stratified 
sampling to the pair (Q, L) with stratification variable Q, we need to address 


two issues: 
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o We need to find intervals Ag, k = 1,...,K, of known (and perhaps of 
equal) probability for Q; these are the strata. 

o We need a mechanism for sampling (Q, L) conditional on Q falling in a 
particular stratum. 


Once we have strata partitioning the real line, we use the decomposition 


K 
Egle OOP L > x} = N P(Q E Ap)Egle tY L > r}Q € Akl. 


k=1 

(9.17) 

We estimate each conditional expectation by sampling (Q, L) conditional on 
Q € Ák and combine these estimates with the known stratum probabilities 


P(Q = Ax). 


Defining Strata 


As explained in Example 4.3.2, defining strata for a random variable is in prin- 
ciple straightforward given its cumulative distribution function. In the case of 
Q, the cumulative distribution is available through transform inversion. How- 
ever, some modification of the inversion integral (9.6) is necessary because for 
(9.17) we need the distribution of Q under the importance sampling measure 
Po. 

Conveniently, Q remains a quadratic function of normal random variables 
under P because each Z; remains normal under F,. Using the parameters in 
(9.12), we find that 


the new coefficients defined by matching terms. The cumulant generating 
function of Q under Pg now has the form in (9.5) but with the new coefficients. 
A somewhat simpler derivation observes that the new cumulant generating 


function — call it Yọ — satisfies 
polu) = log Es[e%“] = log Egf[e*£ e2274] = W(6 +u) — (8). 


The characteristic function of Q under the new measure is thus given by 
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Using this function in the inversion integral (9.6) yields the distribution 
Folz) = P(Q <x); 


By computing (9.6) iteratively we can find points ay, at which Fọ(ak) = pı + 
--.-+ pk for desired stratum probabilities p; and then set Ap = (ax_1, ax). 
The endpoints of the intervals have zero probability and may be omitted or 
included. 


Conditional Sampling 


Given a mechanism for evaluating the distribution Fọ (in this case through 
transform inversion), we could generate Q conditional on Q € Ax using the 
inverse transform method as in Example 4.3.2. However, to use the strati- 
fied decomposition (9.17), we need a mechanism to generate the pair (Q, L) 
conditional on Q € Ax and this is less straightforward. 

Whether or not we apply stratified sampling, we do not generate (Q, L) 
directly. Rather, we generate a normal random vector Z and then evaluate 
both Q and L as functions of Z; see the algorithm of Section 9.2.2. To sample 
(Q, L) conditional on Q € Ax, we therefore need to sample Z conditional on 
Q € A, and then evaluate L (and Q) from Z. 

The analogous step in Section 4.6.2 required us to generate Z conditional 
on u' Z, and this proved convenient because the conditional distribution is 
itself normal. However, no similar simplification applies in generating Z given 
Q; see the discussion of radial stratification in Section 4.3.2. 

In. the absence of a direct way of generating Z conditional on the value of 
a quadratic function of Z, GHS [143] use a brute-force acceptance-rejection 
method along the lines of Example 2.2.8. The method generates independent 
replications of Z from N((6), £(0)), its unconditional distribution under Po, 
evaluates Q, and assigns Z to stratum k if Q € A x. If no more samples for 
this stratum are needed, Z is simply discarded. 

This is illustrated in Figure 9.3 for an example with Q = àZ? + A2Z3 
and positive A;, A2. The four strata of the distribution of Q in the lower part 
of the figure define the four elliptical strata for (Z1, Z2) in the upper part of 
the figure. The target sample size is eight. The labels on the points show the 
order in which they were generated and the labels under the density show 
which samples have been accepted for which strata. ‘The points labeled 6, 9, 
and 10 have been rejected because their strata are full; the second stratum 
needs one more observation, so more candidates need to be generated. 

To formulate this procedure more generally, suppose we have defined strata 
Aj,...,Ax for Q and want to generate ną samples from stratum k, k = 
1,..., K. Consider the following algorithm: 
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Fig. 9.3. Illustration of stratified sampling by acceptance-rejection. The strata of 
the density of Q define elliptical strata for (Z1, Z2). The labels under the density 
indicate the points accepted for each stratum in generating a sample of size eight. 


o initialize stratum counters N; — 0, k = 1,..., K 
o repeat 
— generate Z ~ N(u(0), 4(8)) 
— evaluate Q 
— find & such that Q € A, 
— if Nk < Nk 
Ne — Ngk +1 
LN a e Z 
until all Nk = Nk, hod use 


For each stratum k = 1,...,, this algorithm produces samples Zķj, 
j =1...,n,, with the property that Q evaluated at Z,; is in Az. Also, each 
accepted Zķ; has the conditional distribution of Z given Q € Ag. 

This algorithm potentially generates many more candidate values of Z 
than the total number required (which is nı +---+ ng) and in this respect 
may seem inefficient. But executing the steps in this algorithm can take far 
less time than revaluing the portfolio at each outcome of Z. In that case, 
expending the effort to generate stratified scenarios is justified. Appendix A 
of GHS [143] includes an analysis of the overhead due to this acceptance- 
rejection algorithm. It indicates, for example, that in generating 20 samples 
from each of 100 equiprobable strata, there is less than a 5% chance that the 
number of candidates generated will exceed 3800. 

Write Qkj and Ly; for the values of Q and L computed from Zkj, 7 = 
1,...,m, k = 1,..., K. Then the (Qkj, Lk;) form a stratified sample from 
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the distribution of (Q, L) with stratification variable Q. These can now be 
combined as in (4.32) to produce the estimator 


K Nk 

1 
NAO Pa(Q = Ax) — S e OY Di > £} 
k=1 es) 


of the loss probability P(L > x). 
Numerical results reported in GHS [143] show that this combination of 


importance sampling and stratification can substantially reduce variance, es- 
pecially at values of x for which P(L > x) is small. The results in [143] use 
strata that are equiprobable under the §-twisted distribution, meaning that 


P(Q € At) = + = P(Q € Ax) = 5, 


and a proportional allocation of samples to strata (in the sense of Sec- 
tion 4.3.1). GHS [141] propose and test other allocation rules. They find that 
some simple heuristics can further reduce variance by factors of 2—5. 

If we use only the linear part of Q (a delta-only approximation) for impor- 
tance sampling and stratification, then Q and Z are jointly normal and the 
conditional distribution of Z given Q is again normal. In this case, the con- 
ditional sampling can be implemented more efficiently by sampling directly 
from this conditional normal distribution; see Section 4.3.2. 


Poststratification 


An alternative to using acceptance-rejection for stratification uses uncondi- 
tional samples of Z and weights the stratum averages by the stratum proba- 
bilities. This is poststratification, as discussed in Section 4.3.3. Let Z1, ..., Zn 
be independent samples from N((@), u(@)). Let Ng be the random number 
of these falling in stratum k, k = 1,...,A, and label these 7,;, 7 =1,..., Npk. 
Let (Qkj, Lrj) be the values computed from Z;,,;. The poststratified estimator 


1S 


K 1 Nx 
NOPI 2 An) S e OUTER Te > x}. 
k=1 j=l1 


Take the kth term to be zero if N; = 0. 
As shown in Section 4.3.3, the poststratified estimator achieves the same 


variance reduction as the genuinely stratified estimator in the limit as the 
same size increases. If L is much more time-consuming to evaluate than Q (as 
would often be the case in practice), then stratification through acceptance- 
rejection is preferable; otherwise, the poststratified estimator offers a faster 


alternative. 
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Numerical Examples 


Table 9.1 compares variance reduction factors using the delta-gamma approx- 
imation as a control variate (CV), for importance sampling (IS), and for com- 
bined importance sampling and stratification (IS-S). These results are from 
GHS [142]; more extensive results are reported there and in [143]. 

The results in the table are for portfolios of options on 10 or 100 underlying 
assets. The assets are modeled as geometric Brownian motion, and option 
prices, deltas, and gammas use Black-Scholes formulas. Hach underlying asset 
has a volatility of 0.30 and the assets are assumed uncorrelated except in 
the last case. We assume a risk-free interest rate of r = 5%; for purposes of 
these examples, we do not distinguish between the real and risk-neutral rates 
of return on the assets, using r for both. We use the following simple test 


portfolios: 


(A) Short positions in 10 calls and 5 puts on each of 10 underlying assets, all 
options expiring in 0.1 years. 
(B) Same as (A), but with the number of puts increased to produce a net 


delta of zero. 

(C) Short positions in 10 calls and 10 puts on each of 100 underlying assets, 
all options expiring in 0.1 years. 

(D) Same as (C), but with each pair of underlying assets having correlation 
0.2. 


We use portfolios of short positions so that large moves in the prices of 
the underlying assets produce losses. In each case, we use a loss threshold x 
resulting in a loss probability P(L > x) near 5%, 1%, or 0.5%. This threshold 
is specified in the table through x,¢q, the number of standard deviations above 
the mean in the distribution of Q: 


x = E[Q] + £sta V Var[Q] 


m 


a +Y Aj + asta, | X (0? +222). 


j=1 j=1 


The results displayed in the last three columns of Table 9.1 are variance 
reduction factors. Each is the ratio of variances using ordinary Monte Carlo 
and using a variance reduction technique. The results indicate that the delta- 
gamma control variate (column CV) yields modest variance reduction and that 
its effectiveness generally decreases at smaller loss probabilities. Importance 
sampling (IS) yields greater variance reduction, especially at the smallest loss 
probabilities. The combination of importance sampling and stratification (IS- 
S) yields very substantial variance reduction. The variance ratios in the table 
are estimated using 120,000 replications for each method, with 40 strata and 
3000 replications per strata for the IS-S results. 
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Variance Ratios 
Portfolio zsa P(L >a) CV IS IS-S 


(A) 18 50% 5 7 30 
26 11% 3 22 70 
3.3 03% 2 27 173 
(B) 19 47% 3 6 14 
28 11% 2 18 ~ 30 
3.2 05% 2 28 48 
(C) 25 10% 3 27 45 
(D) 25 10% 2 10 23 


Table 9.1. Variance reduction factors in estimating loss probabilities using the 
delta-gamma approximation as a control variate (CV), for importance sampling 
(IS), and for importance sampling with stratification (IS-S). 


More extensive results are reported in GHS [143]. Some of the test cases 
included there are specifically designed to challenge the methods. These in- 
clude portfolios of digital and barrier options combined in proportions that 
yield a net delta of zero. None of the variance reduction methods based on 
Q is effective in the most extreme cases, but the overall pattern is similar to 
the results in Table 9.1: importance sampling is most effective at small loss 
probabilities and stratification yields substantial additional variance reduc- 
tion. Similar observations apply in estimating a conditional excess loss; the 
variance reduction achieved in estimating E[L|L > xæ] is usually about the 
same as that achieved in estimating P(L > zx). 


9.3 A Heavy-Tailed Setting 


9.3.1 Modeling Heavy Tails 


We noted in Section 9.1.1 that the normal distribution has shortcomings as a 
model of changes in market prices: in virtually all markets, the distribution of 
observed price changes displays a higher peak and heavier tails than can be 
captured with a normal distribution. This is especially true over short time 
horizons; see, for example, the daily return statistics in Appendix F of Duffie 
and Pan [103]. High peaks and heavy tails are characteristic of a market with 
small price changes in most periods accompanied by occasional very large 
price changes. 

While other theoretical distributions may do a better job of fitting market 
data, the normal distribution offers many convenient properties for modeling 
and computation. Choosing a description of market data thus entails a com- 
promise between realism and tractability. This section shows how some of the 
methods of Section 9.2 can be extended beyond the normal distribution to 


capture features of market data. 
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The qualitative property of having a high peak and heavy tails is often 
measured through kurtosis. The kurtosis of a random variable X with mean 
u is given by 

E[(X — p)*] 
E(X =u 
assuming X has a finite fourth moment. Every normal random variable has 
a kurtosis of 3; distributions are sometimes compared on the basis of excess 
kurtosis, the difference between the kurtosis and 3. A sample kurtosis can 
be calculated from data by replacing the expectations in the definition with 
sample averages. 

Kurtosis normalizes the fourth central moment of a distribution by the 
square of its variance. If two distributions have the same standard deviation, 
the one with higher kurtosis will ordinarily have a higher peak and heavier 
tails. Such a distribution is called leptokurtotic. 

The simplest extension of the normal distribution exhibiting higher kur- 
tosis is a mixture of two normals. Consider a mixture 


qN (0,03) + (1 — a) N (0, 03) 


with q € (0,1). By this we mean the distribution of a random variable drawn 
from N (0,0?) with probability q and drawn from N(0,02) with probability 
1 — q. Its variance is 
o° = qo} + (1 — q)03; 
and its kurtosis is 
3(goi + (1 — 03) 
(qo + (1 — q)o3)? 


Whereas the normal distribution N (0, o°) with the same variance has kurtosis 
3, the mixture can achieve arbitrarily high kurtosis if we let a, increase and let 
q approach zero while keeping the overall variance o? unchanged. Figure 9.4 
compares the case a; = 1.4, o2 = 0.6, and q = 0.4 with the standard normal 
density, both having a standard deviation of 1. 

This mechanism describes a market in which a fraction q of periods (e.g., 
days) have high variability and a fraction 1 — q have low variability. It can 
be extended to a multivariate model by mixing multivariate normals N(0, X1) 
and N(0, X2). All of the variance reduction techniques discussed in Section 9.2 
extend in a straightforward way to this class of models. The techniques apply 
to samples from each N(0, %;), and results from the two distributions can be 


combined using the weights q and 1 — q. 


Heavy Tails 


Kurtosis provides some information about the tails of a distribution, but it is 
far from a complete measure of the heaviness of the tails. Further information 
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Fig. 9.4. Solid line is the mixed normal density with o1 = 1.4, og = 0.6, and 
q = 0.4. Dashed line is the standard normal density. 


is provided by the rate of decay of a probability density or, equivalently, the 
number of finite moments or exponential moments. 

To focus on just one tail, consider the case of a nonnegative random vari- 
able X, which could be the positive part of a random variable taking both 
negative and positive values. The following conditions define three important 


categories of distributions: 


(i) Elexp(@X)] < œ for all 0 € R; 

(ii) Elexp(@X)] < oo for all 6 < 6* and Eļexp(0 X)| = oo for all 0 > 0*, for 
some 6* € (0, 00); 

(iii) ELX"] < oo forall r < v and E[X"] = œ for all r > v, for some v € (0, 00). 


The first category includes all normal random variables (or their positive 
parts) and all bounded random variables. The second category describes dis- 
tributions with exponential tails and includes all gamma distributions and in 
particular the exponential density 6* exp(—0*x). The third category (for which 
Elexp(@X )] = oo for all 0 > 0) describes heavy-tailed distributions. This cat- 
egory includes the stable Paretian distributions discussed in Section 3.5.2, for 
which v < 2. This category includes distributions whose tails decay like x~” 
and, more generally, regularly varying tails, as defined in, e.g., Embrechts et 
al. [111]. These three categories are not exhaustive; the lognormal, for ex- 
ample, fits just between the second and third categories having 0* = 0 and 
= OO; 

Empirical data is necessarily finite, making it impossible to draw definite 
conclusions about the extremes of a distribution. Many studies have found 
that the third category above provides the best description of market data, 
with v somewhere in the range of 3-7, depending on the market and the time 
horizon over which returns are measured. Although a mixture of two normal 
distributions can produce an arbitrarily large kurtosis, it lies within the first 
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category above (because the tail is ultimately determined by the larger of 
the two standard deviations), so it does not provide an entirely satisfactory 


model. 


Student t Distribution 


An alternative extension of the normal distribution that provides genuinely 
heavy tails is the Student t distribution with density 


—(v+1)/2 
fala) = “(1+ =) , —00 < T<, 


Vval(v/2) 


I(-) here denoting the gamma function. The parameter v (the degrees of 
freedom) controls the heaviness of the tails. If X has this t, density, then 


V 


P(X > 2x) ~ constant x £~ 


as x — oo, and v determines the number of finite moments of |X| in the sense 
of category (iii) above. If v > 2, then X has variance v/(v — 2). The standard 
normal density is the limit of f, as v — oo. 

Figure 9.5 compares the ts density (solid line) with a normal density 
(dashed line) scaled to have the same variance. The higher peak of the t 
distribution is evident from the left panel of the figure, which plots the den- 
sities. The right panel shows the logarithms of the densities from which the 
heavier tails of the t distribution are evident. 


=e -10 -5 0 5 10 


Fig. 9.5. Comparison of ts (solid) and normal distribution (dashed) with the same 
variance. Left panel shows the densities, right panel shows the log densities. 


A t, random variable can be represented as a ratio Z/,/Y/v in which Z 
has the standard normal distribution and Y has the chi-square distribution 
Ge and is independent of Z. This representation shows that t, is a mixture 
of normals, but rather than mixing just two normals, the t, mixes infinitely 
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many. If we interpret a mixture of two normals as a normal distribution with 
random variance taking values g? and oĉ, then the t, can be thought of as a 
normal with random variance equal to v/Y. This suggests a mechanism for 
generating an even richer class of distributions by replacing v/Y with other 
random variables. 


Multivariate t 


Measuring the risk in a portfolio requires a multivariate model of changes in 
market prices, so we need a multivariate model with heavy-tailed marginals. 
A t density in R” (as defined in Anderson [17]) is given by 


Here, X is a symmetric, positive definite matrix, |X| is its determinant, and 
v is again the degrees of freedom parameter. If v > 2, then the distribution 
has covariance matrix v¥i/(v — 2). If all diagonal entries of X are equal to 
1, then X is the correlation matrix of the distribution (assuming v > 2), 
and each marginal is a univariate t, distribution. Without this restriction on 
the diagonal of X, each coordinate has the distribution of a scaled t, random 
variable. In the limit as v — oo, (9.19) becomes the density of the multivariate 
normal distribution N(0, X). 

If (X1,...,Xm) have (9.19) as their joint density, then they admit the 


representation 


GE s E 
X 3 . æ o b] Am = Å— Aa, 9.20 
where =q denotes equality in distribution, € = (é1,...,€m) has distribution 


N(0,=), and Y has distribution x? independent of £. A multivariate t ran- 
dom vector is therefore a multivariate normal vector with a randomly scaled 
covariance matrix. Also, it follows from (9.19) that the vector X with density 


fv x can be represented as 
AZ P 


= AX, 
VY/v 


where AA’ =, Z ~ N(0, I), and X is a multivariate t random vector with 
density f,,;. The components of X are uncorrelated (their correlation matrix 
is the identity), but not independent. Dependence is introduced by the shared 
denominator. 

The representation (9.20) and the factorization in (9.21) make (9.19) a 
particularly convenient multivariate distribution and provide a mechanism for 
simulation. Much of what we discuss in this section extends to other random 
vectors with the representation (9.20) for some other choice of denominator. 


X= (9.21) 
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A shortcoming of (9.19) is that it requires all marginals to have the same 
parameter v and thus the same degree of heaviness in their tails. Because 
empirical evidence suggests that for most markets v should be in the range of 
3-7, this is not a fatal flaw. It does, however, suggest considering more general 
multivariate t distributions. 


The t-Copula 


One mechanism for allowing different coordinates. to have different parameters 
starts with a multivariate t vector of the type in (9.20) and then modifies the 
marginals. Let F, denote the cumulative distribution function of the univari- 
ate t, distribution. Let the vector X have the representation in (9.20) with 
£ having all diagonal entries equal to 1. This implies that X; ~ t, and then 
that F,(X;) is uniformly distributed on the unit interval. Just as in the inverse 
transform method (Section 2.2.1), applying an inverse distribution Fy, l gives 
F7 (F,(X;)) the t distribution with v; degrees of freedom. Applying such a 
transformation to each coordinate produces a vector 


(Xis. , Xm) = (FZ CG) cae, A), (9.22) 


the components of which have t distributions with arbitrary parameters r, 
E a 

This is a special case of a more general mechanism (to which we return in 
Section 9.4.2) for constructing multivariate distributions through “copulas,” 
in this case a t-copula. The transformation in (9.22) does not preserve cor- 
relations, making it difficult to give a concise description of the dependence 
structure of (X ieee Gy): it does however preserve Spearman rank correla- 
tions. This and related properties are discussed in Embrechts et al. [112]. 

To simulate changes AS; in market prices using (9.22), we would set 


ee 
AS; = õi jp -X,, (9.23) 


assuming v; > 2. This makes õ? the variance of AS; and gives AS; a 
scaled t,, distribution. The specification of the model would run in the op- 
posite direction: using market data, we would first estimate o; and v; for 
each marginal AS;, i = 1,...,™m; we would then apply the transformation 
X; = F7'(F,,(AS;)) to each coordinate of the data, and finally estimate 
the correlation matrix X of (X1,..., Xm). A similar procedure is developed 
and tested in Hosking, Bonti, and Siegel [188] based on the normal distrib- 
ution (i.e., with v = co); GHS [144] recommend using a value of v closer to 
V1,...,Vm. Estimation of v; is discussed in Hosking et al. [188] and Johnson, 
Kotz, and Balakrishnan [202]. 
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9.3.2 Delta-Gamma Approximation 


The variance reduction techniques in Section 9.2 rely on the tractability of 
the delta-gamma approximation (9.2) when AS has a multivariate normal 
distribution. To extend these techniques to the multivariate t, we therefore 
need to extend the analysis of the delta-gamma approximation. This analysis, 
carried out in GHS [144], is of independent interest because it permits use 
of the approximation as a rough but fast alternative to simulation. The ex- 
tension applies to a large class of distributions admitting the representation 
(9.20), in addition to the multivariate t. We comment on the application to 
the transformed multivariate t in (9.22) after considering the simpler case of 
(9.19). 

Suppose, then, that AS has the multivariate t, density in (9.19) with 
X = Mg. By following exactly the steps leading to (9.4), we arrive at the 


representation 


Q=at S (8; X; + dj X?) (9.24) 
j=i 


for the quadratic approximation to the loss L, with 
CX = AS, X=(X,...,Xm)', 


b = —C! ô, b = (by,...,bm)', 


Xm 


where 1,...,Am are the eigenvalues of —igI'/2 and X has the multivariate 
ty density (9.19) with X = J the identity matrix. The constant a in (9.24) is 
—(At)OV/Ot, just as in (9.3). 

At this point, the analyses of the normal and ¢ distributions diverge. Un- 
correlated jointly normal random variables are independent of each other, so 
(9.4) expresses Q as a sum of independent random variables. But as explained 
following (9.21), the components X1,..., Xm of the multivariate t random 
vector are not independent even if they are uncorrelated. The independence 
of the summands in (9.4) allowed us to derive the moment generating function 
(and then characteristic function) of Q as the product of the moment generat- 
ing functions of the summands. No such factorization applies to (9.24) because 
of the loss of independence. Moreover, in the heavy-tailed setting, the X; and 
Q do not have moment generating functions; they belong to category (iii) 
of Section 9.3.1. This further complicates the derivation of the characteristic 


function of Q. 
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Indirect Delta-Gamma 


These obstacles can be circumvented through an indirect delta-gamma ap- 
proximation. Use (9.20) to write 


Z? 
nae 2 (Be Sug y 2) 
with Z1,..., Zm independent N (0, 1) random variables. For fixed x € R, define 


= (Y/v)(Q— x) =(a—2)(Y/v) + X (b; VY/vZ; +327) (9.25) 


j=l 


and observe that 

P(Q <2) = P(Qz < 0). (9.26) 
Thus, we can evaluate the distribution of Q at x indirectly by evaluating that 
of Q, at zero. The random variable Q, turns out to be much more convenient 


to work with. 
Conditional on Y, Q, is a quadratic function of independent standard nor- 


mal random variables. We can therefore apply (9.5) to get (as in Theorem 3.1 
of GHS [144]) 
T PYD | A 
E[e’?=|Y] = exp | (a — x)ðY/v +2 2I- DY; [| —=, (9.27) 
— 20A; mee i TOA 


provided 2 max; 0A; < 1. The coefficient of Y in this expression is 


a(0) = (a — z)0/v + +3 ay, Ui (9.28) 


Suppose the moment generating function of Y is finite at a(@): 
dy (a(0)) = Ele*”] < 00, 


Then (9.27) yields 


m. 
— 0Qz] — 

ba(0) = E9] = prot) TT ag (9.29) 
This is the moment generating function of Q,. Replacing the real argument 
6 with a purely imaginary argument /—lu, u € R, yields the value of the 
characteristic function at u. This characteristic function can be inserted in 
the inversion integral (9.6) to compute P(Q; < 0) and thus P(Q < zx). 

In the specific case that X has a multivariate t, distribution, Y is y2 and 
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by (9) = (1-26), 0< 1/2, 


which completes the calculation of (9.29). However, this derivation shows that 
Q, has a tractable moment generating function and characteristic function in 
greater generality. 

Calculation of the distribution of Q through inversion of the characteristic 
function of @, is not quite as efficient as the approach in the multivariate 
normal setting of Section 9.1.2 based on direct inversion of the characteristic 
function of Q. Using the Fast Fourier Transform (as discussed in, e.g., Press et 
al. [299]), the distribution of Q can be evaluated at. N points using N terms in 
the inversion integral in O(N log N) time, given the characteristic function of 
Q. In the indirect approach based on (9.26), evaluating the distribution of Q 
at N points requires N separate inversion integrals, because @, depends on z. 
With N terms in each inversion integral, the total effort becomes O(N). This 
is not a significant issue in our application of the approximation to variance 
reduction because the number of points at which we need the distribution of 
Q is small — often just one. 

The derivation of (9.29) relies on the representation (9.20) of the mul- 
tivariate t distribution (9.19) with common parameter v for all marginals. 
To extend it to the more general case of (9.22) with arbitrary parameters 
Vi, ..-, Vm, we make a linear approximation to the transformation in (9.22). 
This converts a quadratic approximation in the X; variables into a quadratic 
approximation in the X; variables. 

In more detail, write AS; = K;(X;), i = 1,...,m, for the mapping from 
X1,..-,;Xm to AS defined by (9.22) and (9.23). The deltas and gammas of 
the portfolio value V with respect to the X; are given by 


oV _ , OK; 

OX; Ox 
2 l 2 N2 277) 
FV _ 7 OK OK 54, OV 1, (OK), OR 
OX,OX; OX; OX; OX; OX; OX: 


with all derivatives of K evaluated at 0. Using these derivatives, we arrive 
at an approximation to the change in portfolio value (as in (9.2)) that is 
quadratic in X,,..., Xm and to which the indirect approach through Q, and 


(9.29) then applies. 


9.3.3 Variance Reduction 


We turn now to the use of the delta-gamma approximation for variance reduc- 
tion in the heavy-tailed setting. With normally distributed AS, we applied 
importance sampling by defining an exponential change of measure through 
Q in (9.10). A direct application of this idea in the heavy-tailed setting is im- 
possible, because Q no longer has a moment generating function (it belongs to 
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category (iii) of Section 9.3.1), so there is no way to normalize the exponen- 
tial twisting factor exp(0Q) to produce a probability distribution for positive 
values of 0. Indeed, importance sampling with heavy-tailed distributions is 
a notoriously difficult problem. Asmussen et al. [23] demonstrate the failure 
of several plausible importance sampling estimators in heavy-tailed problems. 
Beyond its relevance to risk management, the example we treat here is of in- 
terest because it is one of few examples of effective importance sampling with 
heavy-tailed distributions. 


Importance Sampling 


We circumvent the difficulty of working with Q by exponentially twisting 
the indirect approximation Q, instead. We derived the moment generating 
function of Qr in (9.29). Let w,(@) denote its cumulant generating function, 
the logarithm of (9.29). Define an exponential family of probability measures 
P through the likelihood ratio 


d Po OQs—Yr (0) | 
—— = s — Wa 9.30 
dP s i ( ) 


for all 6 at which ~,(0) < co. This allows us to write loss probabilities as 
P(L > y) = Eg Ga > y} 


and this yields the importance sampling estimator 


with (Qz, L) sampled from their joint distribution under Pg. 

Before proceeding with the mechanics of this estimator, we should point 
out what makes this approach appealing. We are free to choose z, so suppose 
we have y = x. The second moment of the estimator is then given by 


dP 


=E eee > x} 


which shows that to reduce variance we need the likelihood ratio to be small 
on the event {L > x}. If Q provides a reasonable approximation to L, then 
when L > x we often have Q > x and thus Qx > 0, and this tends to produce 
a small likelihood ratio if @ is positive. 

This tendency must be balanced against the magnitude of Yg. We strike a 
balance through the choice of parameter 6. Choosing 0 equal to 0y, the root 
of the equation 


pi (Oz) = 0, 
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minimizes the convex function y,. Finding 0, is a one-dimensional minimiza- 
tion problem and can be solved very quickly. This choice of 6 imposes the 
condition (see (9.16) and the surrounding discussion) 


Eo, [Qz] =0 


and thus centers the distribution of @, near 0. This centers the distribution 
of Q near x which in turn makes the event {L > x} less rare. 

To achieve these properties, we need to be able to sample the pair (Qz, L) 
from their distribution under P4. As in Section 9.2.2, we do not sample these 
directly; rather, we sample Y and Z1,..., Zm. From these variables we com- 
pute Q, and X1,..., Xm and from the X; we compute L. Thus, it suffices to 
find the distribution of (Y, Z1, ..., Zm) under Po. 

This problem is solved by Theorem 4.1 of GHS [144] and the answer is in 
effect embedded in the derivation of the moment generating function (9.29). 
Let fy denote the density of Y under the original probability measure P = Py; 
this is the x2 density in the case of the multivariate t model. Let a(@) be as 
in (9.28). Then under Ps, Y has density 


frollu) =e?" fy (y)/dy (0), (9.31) 
with ¢y the moment generating function of fy. Conditional on Y, the com- 
ponents of (Z1,..., Zm) are independent normal random variables Z; ~ 


N (u;(0), 03(0)), with 


= 0b; ,/Y/v 2 > 1 


This is verified by writing the likelihood ratio for the change of distribution 
in (9.31) and multiplying it by the likelihood ratio corresponding to (9.32) 
given Y. (The second step coincides with the calculation in Section 9.2.2.) 
The product of these factors simplifies to (9.30), as detailed in GHS [144]. 
Notice that (9.32) also follows from (9.25) and (9.12) once we condition on Y. 

This result says that the change of measure defined by exponentially twist- 
ing Qz can be implemented by exponentially twisting Y and then changing 
the means and variances of the Z; given Y. We have circumvented the diffi- 
culty of applying importance sampling directly to the t random variables X; 
by instead applying exponential twisting to the denominator and numerator 
random variables. 

In the specific case that Y ~ x2, the change of measure (9.31) gives Y 
a gamma distribution with shape parameter v/2 and scale parameter 2/(1 — 


2a(0)); i.e., 


i= (Tam) Roar e? (array): 


The y2 distribution is the gamma distribution with shape parameter v/2 and 
scale parameter 2, so a negative value of a(@) decreases the scale parameter. 


attr rit va J A NO 
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Importance Sampling Algorithm 


The following algorithm summarizes the implementation of this importance 
sampling technique in estimating a loss probability P(L > x): 


1. Find 6, solving w’(@,) = 0 
2. For each of n replications 


(a) generate Y from fy, 
(b) given Y, generate Z from N(u(0), ©(0)) with parameters given 


calculate portfolio loss L and approximation Qz 


f) evaluate 


3. Calculate average of (9.33) over the n replications. 


It is not essential (or necessarily optimal) to take the x in the definition 
of Qs to be the same as the loss threshold in the probability P(L > x). In 
estimating P(L > y) at several values of y from a single set of replications, 
we would need to use a single value of x. This value should be chosen near 
the middle of the range of y under consideration, or a bit smaller. 


Asymptotic Optimality 


GHS [144] establish an asymptotic optimality result for this approach applied 
to estimating P(Q > x). This should be viewed as indirect support for use of 
the method in the real problem of estimating P(L > x). The notion of asymp- 
totic optimality in this setting is appropriate to a heavy-tailed distribution 
and thus refers to a polynomial rather than exponential rate of decay. The 
precise statement of the result is a bit involved (see Theorem 5.1 of [144]), but 
it says roughly that P(Q > x) is O(2—”/?) and that the second moment of the 
estimator (9.33) (with L replaced by Q) is O(a~”). Thus, the second moment 
decreases at twice the rate of the probability itself. This result is relatively 
insensitive to the choice of parameter 6, as explained in GHS [144]. 


Stratified Sampling 


To further reduce the variance of the estimator (9.33), we stratify Q,. The 
procedure is nearly the same as the one in the light-tailed setting of Sec- 
tion 9.2.3. 

To define strata, we need the distribution of Qz (for fixed z) under Py. We 
get this distribution through numerical inversion of the characteristic function 
of Qr under Ps. The moment generating function of Q; under Po is given by 
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ur ¢(6+u)/¢(@) and evaluating this function along the imaginary axis yields 
the characteristic function. 

Once we have defined strata for Qz, we can use the acceptance-rejection 
procedure of Section 9.2.3 to generate samples of X conditional on the stratum 
containing Qz. From these we generate price changes AS and portfolios losses 
L. Through these steps, we produce pairs (Qz, L) with Qz stratified. 


Numerical Examples 


Table 9.2 shows variance reduction factors in estimating loss probabilities for 
four of the test cases reported in GHS [144]. Portfolios (A) and (B) are as 
described in Section 9.2.3, except that each AS; is now a scaled ¢ random 


variable, 
V5 — 2 


AS; =õ; tus. 


V5 ú 

For cases (A) and (B) in the table, all marginals have v; = 5. Cases (A’) and 
(B’) use the same portfolios of options but of the ten underlying assets, five 
have v; = 3 and five have v; = 7. These are generated using the t-copula in 
(9.22) starting from a reference value of v = 5. 

In these examples, as in those of Section 9.2.3, we assume the portfolio 
value is given by applying the Black-Scholes formula to each option. We use an 
implied volatility of 0.30 in evaluating the Black-Scholes formula. The variance 
of AS; is as, which corresponds to an annual volatility of &;/5;/At, given 


Sj. To equate this to the implied volatility, we set o; = 0.30S; VAt. There 
is still an evident inconsistency in applying the Black-Scholes formula with t- 
distributed price changes, but option pricing formulas are commonly used this 
way in practice. Also, the variance reduction techniques do not rely on special 
features of the Black-Scholes formula, so the results should be indicative of 
what would be obtained using more complex portfolio revaluation. 

The stratified results use 40 approximately equiprobable strata. The vari- 
ance ratios in the table are estimated from 40,000 replications. With stratified 
sampling, this allows 1000 replications per stratum. 


Variance Ratios 
Portfolio x P(L>az) IS  IS-S 
(A) 469 1% 46 134 
(B) 617 1% 42 112 
(A’) 475 1% 38 55 
(B’) + 671 1% 39 57 
Table 9.2. Variance reduction factors in estimating loss probabilities using the 
delta-gamma approximation for importance sampling (IS) and for importance sam- 
pling with stratification (IS-S). 
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The results in Table 9.2 show even greater variance reduction in the heavy- 
tailed setting than in the normal setting of Table 9.1. The results also suggest 
that the methods are less effective when applied with the t-copula (9.22) than 
when all marginals have the same parameter v. Both of these observations 
are further supported by the more extensive numerical tests reported in GHS 
[144]. The worst cases found in [144], resulting in little or no variance reduc- 
tion, combine the t-copula model of underlying assets with portfolios of digital 
and barrier options. Results in [144] also indicate that the variance reduction 
achieved in estimating a conditional excess E[L|L > x| is usually about the 
same as that achieved in estimating the loss probability P(L > <). 

Figure 9.6 (from [144]) gives a graphical illustration of the variance reduc- 
tion achieved with portfolio (A). The figure plots estimated loss probabilities 
P(L > x) for multiple values of z. The two outermost lines (solid) show 
99% confidence intervals using standard simulation; these lines are formed 
by connecting confidence limits computed at each value of x. The two dot- 
ted lines show the tight intervals achieved by combining importance sampling 
and stratified sampling. The results using the variance reduction techniques 
were all estimated simultaneously: all use the same twisting parameter 6,, the 
same strata, and the same values of (Qz, L), with x œ~ 400. This is important 
because in practice one would not want to generate separate scenarios for 
different thresholds x. 


— Standard Simulation 
IS and Stratification 


200 300 400 500 600 
X 


Fig. 9.6. Confidence intervals for loss probabilities for portfolio (A) using standard 
simulation and importance sampling combined with stratification, from GHS [144]. 


The figure also illustrates how reducing variance in estimates of loss prob- 
abilities results in more precise quantile estimates. To estimate the 99% VAR, 
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for example, we would read across from 0.01 on the vertical axis until we hit 
the estimated curve P(L > x), and then read down to find the correspond- 
ing value of x. A rough indication of the uncertainty in the VAR estimate 
is the space between the confidence interval curves at a height of 0.01. The 
outer confidence bands (based on standard simulation) produce a wide inter- 
val ranging roughly from x = 375 to 525, whereas the inner confidence bands 


produce a tight interval for the quantile. 


9.4 Credit Risk 


Sections 9.1-9.3 of this chapter deal with market risk — the risk of losses 
resulting from changes in market prices. Credit risk refers to losses resulting 
from the failure of an obligor (a party under a legal obligation) to make a 
contractual payment. Credit risk includes, for example, the possibility that a 
debtor will fail to repay a loan, a bond issuer will miss a coupon payment, or a 
counterparty to a swap will fail to make an interest payment. Market risk and 
credit risk interact, so the distinction between the two is not always sharp; 
but the models and methods used in the two settings differ in the features 
they emphasize. 

This section introduces some of the methods used to model credit risk, with 
emphasis on their implementation through simulation. The development of 
credit risk models remains an active area of research, and the field is currently 
spread wide over modeling variations too numerous to survey. Our goal is to 
illustrate some of the key features of models used in this area. 

Section 9.4.1 discusses models of the time to default for a single obligor. 
Such models are used in valuing corporate bonds and credit derivatives tied to 
a single source of credit risk. Section 9.4.2 discusses mechanisms for capturing 
dependence between the default times of multiple obligors; loosely referred to 
as “default correlation.” This is a central issue in measuring the credit risk 
in a portfolio and in valuing credit derivatives tied to the creditworthiness of 
multiple obligors. Section 9.4.3 deals with a simulation method for measuring 


portfolio credit risk. 


9.4.1 Default Times and Valuation 


The simplest setting in which to discuss credit risk is the valuation of a zero- 
coupon bond subject to possible default. The issuer of the bond is scheduled 
to make a fixed payment of 1 at a fixed time T. If the issuer goes into default 
prior to time T, no payment is made; otherwise the payment of 1 is made as 
scheduled. By letting r denote the time of default, we can combine the two 
cases by writing the obligor’s payment as 1{7 > T}. 

At the heart of most models of credit risk is a mechanism describing the 
occurrence of default and thus the distribution of 7. In referring to the dis- 
tribution of r, we should distinguish the distribution under a risk-neutral 
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probability measure from the distribution under the objective probability 
measure describing the observed time to default. The risk-neutral distribu- 
tion is appropriate for valuation and the objective distribution is appropriate 
for measuring risk. Some models of default are used in both settings; we will 
not explicitly distinguish between the two and instead let context determine 
which is relevant. 

Given the (risk-neutral) distribution of 7 and a short rate process r(t}, the 
value of the bond payment 1{7 > T} is in principle determined as 


E e OLETE > r) , (9.34) 


More generally, if default at time r < T yields a cashflow of X (r), then the 
value of the bond is given by 


T T 
E ent r(t)  X (7) te se. r(t) “aly > r} , 


assuming the event {r = T} has zero probability. Expressions of this type 
are sometimes used in reverse to find the distribution of the time to default 
implied by the market prices of corporate bonds. 


Default and Capital Structure 


A line of research that includes Merton [262], Black and Scholes [50], Leland 
[230], and much subsequent work values corporate bonds by starting from 
fundamental principles of corporate finance. This leads to a characterization 
of the default time 7 as a first-passage time for the value of the issuing firm. 

This approach posits a stochastic process for the value of a firm’s assets. 
The value of the firm’s assets equals the value of its equity plus the value of its 
debt. The limited liability feature of equity makes it an option on the firm’s 
assets: if the value of the assets becomes insufficient to cover the debt, the 
equity holders may walk away and surrender the firm to the bond holders. 

Valuing equity and debt thus reduces to valuing a type of barrier option 
in which the firm’s assets act as the underlying state variable and the barrier 
crossing triggers default. Different models make different assumptions about 
the dynamics of the firm’s value and about how the level of the barrier is de- 
termined; see Chapter 11 of Duffie [98] for examples and extensive references. 

In a sufficiently complex model, one might consider using simulation to 
determine the distribution of the default time. Importance sampling would 
then be potentially useful, especially if default is rare. Approximations of 
the type discussed in Section 6.4 can be used to reduce discretization error 
associated with the barrier crossing; see Caramellino and Iovino [75] for work 
on this topic. But looking at default from the perspective of capital structure 
and corporate finance has proved more successful as a conceptual framework 
than as a valuation tool. Simpler models are usually used in practice. We 
therefore omit further discussion of simulation in this setting. 
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Default Intensity 


A stochastic intensity for a default time 7 is a process A(t) for which the 


process 


itane / E (9.35) 


is a martingale. For small At, interpret A(t)At as the conditional probability 
of default in (t,t + At) given information available at time t. This generalizes 
the intensity of a Poisson process, for which A is deterministic. 

To make precise the notion of available information and the martingale 
property of (9.35) we would need to specify a filtration. A default time 7 may 
admit different intensities with respect to different filtrations and may admit 
no intensity with respect to some filtrations. This is not just a technical issue 
— it is significant in modeling as well. For example, if 7 is the time of a 
barrier crossing for Brownian motion, then 7 has no intensity with respect 
to the history of the Brownian motion. This follows from the fact that (9.35) 
does not have the form required by the martingale representation theorem 
(Theorem B.3.2 in Appendix B.3). This also makes precise the idea that an 
observer of the Brownian path could anticipate the barrier crossing. The same 
T could, however, admit an intensity with respect to a different filtration that 
records imperfect information about the Brownian path; see Duffie and Lando 
[102]. 

A consequence of the existence of an intensity is the identity 


Pest N- ex (- f E as) | (9.36) 


The bond-valuation formula (9.34) becomes 


E ex (-/ Ir(u) + A(u)] a) (9.37) 


This is evident if r and À are independent processes, but it holds more gen- 
erally if r is adapted to the filtration with respect to which A is an intensity; 
see Section 11.J of Duffie [98] and the references given there. As formulated 
in [98] the result requires boundedness of r and A, but this condition can be 
relaxed and is routinely ignored in applications. 

An appealing feature of (9.37) is that it values a defaultable bond the same 
way one would value a default-free bond, but with the discount rate increased 
from r tor + à. This is consistent with the market practice of pricing credit 
risk by discounting at a higher rate. It is also consistent with the practice of 
interpreting yield spreads on corporate bonds as measures of their likelihood 
of default. 

Because (9.37) is a valuation result discounting at a risk-free interest rate, 
it should be understood as an expectation under a risk-neutral measure. The 
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relevant processes r and A thus describe the risk-neutral dynamics of the 
short rate and the default intensity. A counterpart of the Girsanov Theorem 
(see Section VI.2 of Brémaud [60]) shows that changing probability measures 
has a multiplicative effect on the intensity. The multiplicative factor may be 
interpreted as a risk premium for default risk. 

The valuation formula (9.37) assumes that in the event of default prior to 
T the bond becomes worthless. In practice, the holder of a bond in default 
usually recovers some fraction of the bond’s promised payments as a result 
of a complex legal process. Incorporating partial recovery in a model requires 
making simplifying assumptions about the amount received in the event of 
default. Duffie and Singleton [106] take the recovery to be a fraction of the 
value of the bond just prior to default; Jarrow and Turnbull [200] take it to 
be a fraction of the value of an otherwise identical default-free bond; Madan 
and Unal [244] use a random recovery rate. These and other assumptions are 
discussed in Bielecki and Rutkowski [46]. 

The approach of Duffie and Singleton [106] leads to a simple extension of 
(9.37). Suppose that the value of the bond just after default at time 7 equals a 
fraction 1 — L(r) of its value just before default. This time-dependent fraction 
may be stochastic. Subject to regularity conditions, (9.37) generalizes to 


E es (~ / [r(w) + A(u) L(w)] aw) l 


In this case, the spread between a defaultable and default-free bond reflects 
the loss given default as well as the probability of default. 


Intensity-Based Modeling 


Intensity-based modeling of default uses a stochastic intensity to model the 
time to default (rather than deriving the intensity from 7). A key property 
for this construction is the fact that the cumulative intensity to default, 


j AU) dt, 


is exponentially distributed with mean 1. (This follows from Theorem II.16 of 
Brémaud [60]|.) Given an arbitrary nonnegative process A and an exponentially 
distributed random variable € independent of A, we may define 


T =inf{t> 0: fa du = £} (9.38) 


(much as in (3.83)), and then 7 has intensity À. This construction is also useful 
in simulating a default time from its intensity. 

A natural candidate for the process A is the square-root diffusion of Sec- 
tion 3.4, because it is positive and mean-reverting. With this choice of inten- 
sity, the tail of the distribution of 7 in (9.36) has exactly the same form as 
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the bond pricing formula of Cox, Ingersoll, and Ross [91]; see Section 3.4.3. 


Thus, 
Put) = P<) ae a (9.39) 
with A and C as defined in Section 3.4.3. To simulate 7 it therefore suffices 
to apply the inverse transform method of Section 2.2.1 to this distribution — 
simulation of A itself is unnecessary. The same method applies, more generally, 
when à belongs to the class of affine jump-diffusions defined in Duffie and Kan 
[101]. For example, Duffie and Gârleanu [99] consider in detail the case of a 
square-root diffusion modified to take exponentially distributed jumps at the 
epochs of a Poisson process. 

Given an expression like (9.39) for the distribution of a default time 7, 
there would be little reason to model the dynamics of the intensity if we were 
interested only in r. But in valuing credit derivatives or measuring credit risk 
we are often interested in capturing dependence between default times and 
dependence between defaults and interest rates. One approach to this speci- 
fies correlated processes for interest rates and intensities. A multifactor affine 
process, as in Duffie and Singleton [106], provides a convenient framework in 
which to model these correlated processes. 


Ratings Transitions 


Corporate bonds and other securities with significant credit risk can lose value 
through credit events less severe than outright default. A change in credit 
rating, for example, can produce a change in the market price of a bond. 

A simple model of changes in credit ratings uses a Markov process 
to describe ratings transition. Consider, for example, a finite state space 
{0,1,..., N} in which each state represents a level of credit quality. States 
with higher indices correspond to higher credit quality; state 0 is an absorbing 
state representing default. Let {X (t), t > 0} be a Markov process on this state 
space with transition rate q(i, j) from state i to state 7, for i, j € {0,1,..., N}. 
If default has not occurred by time t, then the default intensity at time t (with 
respect to the history of the process X) is q(X (t), 0), the transition rate from 
the current state X (t) to the default state 0. A model of this type is developed 
by Jarrow, Lando, and Turnbull [199]. 

One can simulate paths of X by simulating the sojourns in each state and 
the transitions between states. At the entry of X to state i # 0, generate a 
random variable € exponentially distributed with mean 1/q(i), where 


: q(t) = S a(t, 3); 
j#t 


this is the length of the sojourn in state 2. After advancing the simulation 
clock by € time units, generate a transition out of state i by choosing state 7 


with probability q(z,7)/q(z), 7 Z i. 
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9.4.2 Dependent Defaults 


The models described in the previous section focus on a single default time. 
The credit risk in a portfolio and the value of some credit derivatives depend 
on the joint distribution of the default times of multiple obligors; capturing 
this dependence requires a mechanism for linking the marginal default times of 
the individual obligors. The likelihood of default for an individual obligor can 
often be at least partly determined from the credit rating and yield spreads 
on bonds it has issued, so capturing the dependence between obligors is the 
primary challenge in modeling credit risk. 


Using Equity Correlations 


Direct estimation of dependence between default times using historical data is 
difficult because defaults are rare. Whereas daily fluctuations in market prices 
provide a nearly continuous flow of information about market risk, the time 
to default for an investment-grade issue is measured in years. 

The link between default and capital structure discussed in Section 9.4.1 
provides a framework for translating information in equity prices to informa- 
tion about credit risk. Taken literally, this approach would require building 
a multivariate model of the dynamics of firm value for multiple obligors and 
capturing the dependence in their default times through dependence in first- 
passage times to multiple boundaries, each boundary determined by the cap- 
ital structure of one of the obligors. In practice, this idea is used in a much 
simpler way to build models of correlated defaults. 

As a first example of this, we discuss the method of Gupton, Finger, and 
Bhatia [163]. They model the occurrence or non-occurrence of default over a 
fixed horizon (a year, for example) rather than the time to default. To model m 
obligors they use a normally distributed random vector (X1,..., Xm). Each 
X; has a standard normal distribution, but the components are correlated. 
The components are (loosely) interpreted as centered and scaled firm values 
and the correlations are taken from the correlations in equity prices. 

The ith obligor defaults within the fixed horizon if X; < zi, with x; a 
constant. In analogy with Merton’s [262] setting, the threshold x; measures 
the firm’s debt burden. Gupton et al. [163] assume a known default probability 
pi for each obligor and set x; = ®~1(p;) so that 


P(X; < Li) = pi. 


This calibrates the default indicator Y; = 1{X; < x;} for each obligor. The 
joint distribution of the default indicators Y;,..., Ym is implicitly determined 
by the joint distribution of X,,..., Xm. The transformation from the X; to 
the Y; does not preserve linear correlations but, because it is monotone, it 
does preserve rank correlations. 

The setting considered by Gupton et al. [163] is richer than what we have 
described in that it considers ratings transitions as well as default. For each 
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obligor they partition the real line into intervals, each interval corresponding 
to a credit rating (or default). Each interval is chosen so that its probabil- 
ity (under the standard normal distribution) matches the probability that 
the obligor will be in that ratings category at the end of the period. (These 
probabilities are assumed known.) The intervals are defined using the same 
steps used to define strata in Example 4.3.2. The correlations in the X; thus 
introduce dependence in the ratings transitions of the various obligors. 

The use of equity price correlations for the X; is not essential. As in Gupton 
et al. [163], one could also model the X; using a representation of the form 


Xi = Qili +e + ikk + bici t= 1,...,m, (9.40) 


with Z1, ..., Zk and e; independent N (0,1) random variables and aĝ +--+ + 
a2, +b? = 1. The Z; represent factors affecting multiple obligors whereas €; 
affects only the ¿th obligor. The common factors could, for example, represent 
risks specific to an industry or geographical region or a market-wide factor 


affecting all obligors. 


Correlation and Intensities 


As mentioned in Section 9.4.1 and demonstrated in Duffie [98], affine jump- 
diffusion processes provide a convenient framework for modeling stochastic 
intensities. Through a multivariate process of this form, one can specify the 
joint dynamics of the intensities of multiple obligors together with the dynam- 
ics of interest rates for various maturities while retaining some tractability, 
To simulate default times in this framework, we can extend (9.38) and set 


t 
m= inf{t>0: | As(u)du=}, theni 
Jo 


with A; the default intensity for the ith obligor and £1,... ,Em are independent 
unit-mean exponential random variables. Except in special cases, this would 
require simulation of the paths of the intensities because the comparatively 
simple expression in (9.39) for the marginal distribution of a default time does 
not easily extend to sampling from the joint distribution of 71,..., 7m. 

There are several ways of making the 7; dependent in this framework. The 
simplest mechanism uses correlated Brownian motions to drive the intensity 
processes \;. This, however, introduces rather weak dependence among the 
default times: if the intensities for two obligors are close, then their instanta- 
rieous default probabilities are close but the default times themselves can be 
far apart because of the independence of the £;. 

An alternative defines the default intensities to be overlapping sums of 
state variables. Consider a model driven by a d-dimensional state vector X (t) 
with nonnegative components and suppose the default intensity for obligor i 


is given by 
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A(t) = Š XE), 
JEA; 
with A; a subset of {1,...,d}. Interpret each X; as an intensity process and 
using the mechanism in (9.38), let T; be the first arrival time generated by 
this intensity. At T}, obligor 2 defaults if it has not previously defaulted and if 
j € Ai. In this formulation, the X; act like factors driving default and affecting 
one or more obligors. A shortcoming of this approach is that it introduces 
dependence by making defaults occur simultaneously. 

A third approach introduces dependence through the exponential random 
variables €1,...,&m rather than through the intensities. This can provide 
rather strong dependence between default times without requiring simulta- 
neous defaults. Moreover, if the intensities of the obligors are independent 
and if the marginal distributions of the 7; admit tractable expressions F;, as 
in (9.39), then we may avoid simulating the intensities by setting 


= F-"(U;), U; = 1 — exp(—-&). (9.41) 


4 


Each U; is uniformly distributed on the unit interval, but U;,...,Um inherit 
dependence from &),...,&m. This reduces the problem of specifying depen- 
dence among the default times to one of specifying a multivariate exponential 
or multivariate uniform distribution, which we turn to next. 


Normal Copula 


A simple way to specify a multivariate uniform distribution starts from a 
vector (X1,...,Xm) of correlated N (0, 1) random variables and sets 


G20 (Xs), Tehes: | (9.42) 


The uniform random variables U1,...,Um can then be used to generate other 
dependent random variables. For example, setting £; = —log(1 — U;), i = 
1,...,m, makes &;,...,&m dependent exponential random variables. 

This mechanism for introducing dependence is called a normal copula. It 
is implicit in the method of Gupton et al. [163] discussed above (set Y; = 
1{U; < p;}) and similar to the t-copula used in Section 9.3. This construction 
is convenient because it captures dependence through the correlation matrix 
of a normal random vector, though as we noted above, the transformation 
from the X; to other distributions does not in general preserve correlation. 
For more on the correlation of the transformed variables, see Cario and Nelson 
[77], Embrechts et al. [112], Ghosh and Henderson [137], and Li [236]. Shaw 
[329], applying a method of Stein [337], combines a normal copula with Latin 
hypercube sampling in estimating value-at-risk. 


Other Copulas 


We have thus far referred to normal copulas and t-copulas; we now define 
copulas more generally. A copula is a function describing a multivariate dis- 
tribution in terms of its marginals. A copula function with m arguments is a 
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distribution function C on [0,1]™ with uniform marginals. The requirement 
of uniform marginals means that C(u1,...,Um) = ui if u; = 1 for all j Æ i; 
the requirement that C be a distribution on [0,1] cannot be formulated so 
succinctly. It implies, of course, that C is increasing in each argument for 
all values of the other arguments and that C(0,...,0) = 0, C(1,...,1) =1, 
though these properties do not suffice. See Nelsen [276] for a detailed discus- 
sion. 

Given univariate distributions F),..., Fm, a copula function C determines 
a multivariate distribution on R™ with the F; as marginals through the defi- 


nition 


PGi see) SOENG esd al a) (9.43) 

Conversely, given a multivariate distribution F with marginals F,,..., Fm, 
setting 

Cr(u,...,Um) = F(FT (u1), ..., E7! (um)) (9.44) 


shows that every multivariate distribution admits the representation in (9.43). 
Copulas therefore provide a natural framework for modeling dependence be- 
tween default times: the marginal distribution of the time to default for each 
obligor is at least-in part determined by information specific to that obligor; 
the goal of much credit risk modeling is to link these marginal distributions 
into a multivariate distribution. 

The specific constructions based on the t distribution in (9.22) and based 
on the normal distribution in (9.42) are special cases of the general copula 
representation in (9.43). If F is a multivariate normal distribution, Cp in 
(9.44) is a normal copula and if F is a multivariate t distribution, Cp is 
a t-copula. These copula functions can then be applied to other marginal 
distributions to link those marginals using the dependence structure of the 
multivariate normal or t. 

For example, the generalized random ¢ vector in (9.22) has as its joint 
distribution the function 


P(X, ee, CP ar 
= P(FL(X1) < Fi, (21),...,Fu(Xm) < Fi, (tm)) 
= Ct, (Fix (21); ++) Fom (Gm); 


where we have written Ci, for the copula function of the multivariate t distri- 
bution with v degrees of freedom. Here and in (9.22) we apply a ¢-copula to 
marginals that happen to be t distributions themselves, but this is not essen- 
tial. We could replace C;,, with a normal copula (as Hosking et al. [188]) do), 
and we could apply the ¢t-copula to other marginal distributions. Mashal and 
Zeevi [255], for example, apply a t-copula to empirical distributions of asset 
returns. 

A simple application of a normal copula can be used to generate depen- 
dent default times with marginal distributions F),..., Fm. Let X1,...,Xm be 
correlated N(0.1) random variables and set 
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ti = F7+(®(X;)), i=1,...,m. (9.45) 


This combines (9.41) and (9.42) and is among the methods discussed in Li 
[236]. One might look to the correlations in equity returns to determine the 
correlations between the X;, though it is less clear how to make this trans- 
lation here than in the fixed-horizon setting of Gupton et al. [163]. Finger 
[120] compares the dependence between default times achieved through four 
constructions (including (9.45)) calibrated to the same parameters and finds 
that different mechanisms can give significantly different results. 

Various types of copulas are surveyed with a view towards risk manage- 
ment applications in Embrechts, McNeil, and Straumann [112], Hamilton, 
James, and Webber [167], and Li [236]. Scénbucher and Schubert [321] ana- 
lyze the combination of copulas with intensity-based modeling of marginals 
defaults. Hamilton et al. [167] use historical default data to estimate a copula 
empirically. In simulation, it is convenient to work with a copula associated 
with a distribution from which it is easy to draw samples (like the normal 
and ¢ distributions). This facilitates the implementation of the type of mech- 
anism in (9.45). Marshall and Olkin [252] define an interesting class of copula 
functions that also lend themselves to simulation. Schénbucher [320] compares 
credit-loss distributions under various members of this class. 


9.4.3 Portfolio Credit Risk 


The most basic problem in measuring portfolio credit risk is determining the 
distribution of losses from default over a fixed horizon. This is the credit risk 
counterpart of the market risk problem considered in Sections 9.1-9.3. For 
credit risk, one usually considers a longer horizon — a year, for example. 
Consider a fixed portfolio exposed to the credit risk of m obligors. Let 


Y; = indicator that ith obligor defaults within the fixed horizon; 
pi = P(Y; = 1) = marginal probability of default of the ith obligor; 


c; = loss resulting from default of zth obligor; 


m. 
biz Dp Y;c; = portfolio loss. 
i=1 
Estimating the distribution of the loss L often requires simulation, particularly 
in a model that captures dependence among the default indicators Y;. In 
estimating P(L > x) at large values of z, it is natural to apply importance 
sampling. This section describes some initial steps in this direction based on 


joint work with Jingyi Li. 
Independent Obligors 


We begin by considering the simple case of independent obligors. The loss c; 
given default of the ith obligor could be modeled as a random variable but to 
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further simplify, we take it to be a constant. These simplifications make L a 
sum of independent random variables with moment generating function 


E{e”*| = I] Ele lI (pie? + (1 — pi)), 


finite for all 0 € R. Replacing 0 with a purely imaginary argument yields the 
characteristic function of L, which can be inverted numerically to find the 
distribution of L. An alternative to exact inversion is a saddlepoint approxi- 
mation, as applied in Martin, Thompson, and Browne [253]. 

Although simulation is unnecessary in this setting, it provides a convenient 
starting point for developing variance reduction techniques. To simplify the 
problem even further, suppose all obligors have the same default probabilities 
pi = p and exposures c; = 1. Consider the effect of exponentially twisting 
Y;, in the sense we encountered in Example 4.6.2 and Section 9.2.2. In other 
words, define a change of probability measure through the likelihood ratio 


exp(0Y; — y(0)) 


with 

(8) = log (pe? + (1 — p)) 
and 0 € R a parameter. The default probability under the new distribution 
is the probability that Y; = 1 and this is also the mean of Y; under the new 
distribution. Using a standard property of exponential twisting, also used in 
Examples 4.6.2—4.6.4 and (9.16), this mean is given by 


pe’ 
p(6) = y" (0) = E 


By choosing 0 > 0, we thus increase the probability of default. 
Now apply this exponential twist to all the Y;. Because the Y; are indepen- 
dent, the resulting likelihood ratio is the product of the individual likelihood 


ratios and is thus given by 


m 

[| [ exp(0Y; — ¥(6)) = exp(ð L — my(0)). 

i=1 
In other words, in this simple setting, exponentially twisting every Y; defines 
the same change of measure as exponentially twisting L itself. Write Pọ for 
the new probability measure so defined and Eg for expectation under this 
measure. This provides the representation 


P(L > z) = Eo Garn > z} (9.46) 


for the loss probability. The expression inside the expectation on the right 
provides an unbiased estimator of the loss probability if sampled under the new 
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probability measure. Sampling under P9 is easy: it simply involves replacing 
the original default probability p with p(0). 

It remains to choose the parameter 6. The argument leading to (9.15), 
based on minimizing an upper bound on the second moment, similarly leads 
here to the value of 6, solving 


Y (02) = x/m, 
which then also satisfies 
Eo, [LZ] =Eo, |) Yi] = ma'(6n) =. (9.47) 
i=1 


Thus, to estimate P(L > x) for large x, we increase the individual default 
probabilities to make x the expected loss. It follows from more general results 
on importance sampling (as in Bucklew, Ney, and Sadowsky [72] and Sadowsky 
[315]) that this method is asymptotically optimal as x and m increase in a 
fixed proportion. 

This approach extends easily to the case of unequal p; and c;. Set 


Wi(0) = log (pie°“ + (1 —pi)), 


and Yr = Wi +--+ am. Then Yz is the cumulant generating function of L 
and exponentially twisting L defines the change of measure 

dP, 

T = exp(@L — Yr (0)). 
This has the same effect as exponentially twisting every Y;c;. Under Pg, each 
term Y;c; has mean w;(@), which is to say that the ith obligor defaults with 
probability p;(@) = w1(@)/c;. This can also be written as 


pi(6) Ze ( Pi ) ete 
1 — p;(8) 1 — pi 


which shows that taking 0 > 0 increases the odds ratio for every obligor, with 
larger increases for obligors with larger exposures c;. Choosing @ as the root of 
the equation 4%} (0) = x again satisfies (9.47) and minimizes an upper bound 
on the second moment of the estimator. 


Dependent Defaults 


We now turn to the more interesting case in which the Y; are dependent. Differ- 
ent models of dependence entail different approaches to importance sampling; 
we consider dependence introduced through a normal copula as discussed in 
Section 9.4.2. 

For each obligor 7, let 
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zi = ®'(1—p,) (9.48) 


so that the probability of default p; equals the probability that a standard 
normal random variable exceeds x;. Construct default indicators by setting 
Y; = 1{X; > x;} with X; ~ N(0,1). (We are free to choose the threshold z; 
in the lower tail of the normal as in Section 9.4.2 or the upper tail as we do 
here.) Link the random variables X,,..., Xm through a specification of the 
form (9.40); in vector notation, this becomes 


X = AZ + Be, (9.49) 


in which e has the N (0, J) distribution in R”, Z has the N (0, J) distribution in 
RF, and Z is independent of e. The m x k matrix A determines the correlation 
matrix of X, whose off-diagonal entries are given by those of AA! . The mxm 
diagonal matrix B is chosen so that all diagonal entries of the covariance 
matrix AA! + B? equal 1, thus making the components of X standard normal 
random variables. We think of the dimension k of the common factors as 
substantially smaller than the number of obligors; for example, m could be in 
the thousands and k as small as 1-10. 

We apply importance sampling conditional on the common factors. Given 
Z, the vector X is normally distributed with mean AZ and diagonal covariance 
matrix B*. The conditional probability of default of the ith obligor is therefore 

3 Li — QiZ 
fe = PY; = 112) = P(X; > az) =1 -8 (255), (9.50) 
7] 

where a; is the ith row of A and b; is the ith element on the diagonal of 
B. Given Z, the portfolio loss L becomes a sum of m independent random 
variables, the ith summand taking the values c; and 0 with probabilities p; 
and 1 — Di. 

Conditioning on Z thus reduces the problem to the independent case with 
which we began this section. Define 


brz(9) = log Ele®”|Z] = X` log (p,e°% + (1 — p:)) ; (9.51) 


i=1 


this is the cumulant generating function of the distribution of L given Z. Let 


8, solve ; 
Wrz (Gx) as 


Define the conditional default probabilities 


Derr 


5;(0,) = —=——_-, i=l... m. 9.52 
aT oe ae om 
Given Z, the default indicators Y;,...,¥m are independent and, under the 


6,-twisted distribution, Y; takes the value 1 with probability p,;(6,); these are 
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therefore easy to generate. Setting L equal to the sum of the Y;c; yields the 
estimator : i 
e Pek tbrizGe)1 fT > r}; (9.53) 


this is the conditional counterpart of the expression in (9.46). Its conditional 
expectation is P(L > 2|Z) and its unconditional expectation is therefore 


PD >a), 


Factor Twisting 


We can apply further importance sampling to the normally distributed factors 
Z. To see why this might be useful, decompose the variance of the estimator 


(9.53) as 
E[Varg_ fe Geb t riz (Bx) 4 7, > c}/Z]] + Var[P(L > 2|Z)). (9.54) 


The second term is the variance of the conditional expectation of the estimator 
given Z. Twisting the default indicators conditional on Z, as in (9.53), makes 
the first term in this decomposition small but does nothing for the second 
term. Because P(L > |Z) is a function of Z, we can apply importance 
sampling to Z to try to reduce the contribution of the second term to the 


total variance. 
The simplest application of importance sampling to an N(0,/) random 
vector introduces a mean vector p, as in Example 4.6.1 and Section 4.6.2. 


The likelihood ratio for this change of measure is 
exp(-u' Z + u'u). 
When multiplied by (9.53), this yields the estimator 
exp(—u' Z +u u— rL + vrjz(Oc))1{L > zh, (9.55) 
in which Z is sampled from N (u, I) and then L is sampled from the 6,-twisted 


distribution conditional on Z. 


Importance Sampling Algorithm for Credit Losses 


We now summarize the two-step importance sampling procedure in a single 
algorithm. We assume that the matrices A and B in the factor representation 
(9.49) and the thresholds x; in (9.48) have already been defined. We also 
assume that a mean vector u for the common factors has been selected; we 


return to this point below. 


1. Generate Z ~ N(u, T) in RE 
2. Compute conditional default probabilities p;, i = 1,..., m, as in (9.50) 
3. Define pr \z as in (9.51) and find @ solving 5) 7(8) = 2; 
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4. Compute twisted conditional default probabilities p;(0,), i = 1,...,m, 


using (9.52) 
5. Fori=1,...,m, let Y; = 1 with erent Pilz) and Y; = 0 


otherwise. Caicilale loss L = c1 Yi +: + CmYm 
6. Return estimator (9.55). 


These steps compute a single replication of the estimator (9.55), which is 
an unbiased estimator of P(L > x). The steps can be repeated and the results 
averaged over independent replications. 

At some values of Z (large values, if all the coefficients in (9.40) are pos- 
itive), the conditional default probabilities p; may be sufficiently large that 
the expected loss given Z exceeds 7z; i.e., Wr z(0) > «x. In this case, the 0 
calculated in Step 3 of the algorithm would be negative, and twisting by this 
parameter would decrease the default probabilities. To avoid this, Step 3 sets 
6, = 0 if the root @ is negative. 


Choice of Factor Mean 


It remains to specify the new mean u for the common factors Z. As noted in 
the discussion surrounding (9.54), the conditional twist (9.53) reduces variance 
in estimating P(L > x|Z) and the purpose of applying importance sampling 
to Z is to reduce variance in estimating the expectation of P(L > x|Z), viewed 
as a function of Z. We should therefore choose u to minimize this variance or, 
equivalently, the second moment 


E, Daa > |Z)" | (9.56) 


We have subscripted the expectation by u to emphasize that it is computed 
with Z having distribution N(, I). 

At this point, we restrict attention to a very special case. We take all the 
default probabilities p; to be equal and we take all c; = 1. This makes L the 
number of defaults. We take Z to scalar and all X; in (9.49) of the form 


Xi = pZ+V/1— prei, 


with the same scalar p for all i =1,...,m. 

At p = 0, the default indicators v= = 1{X; > xi} become independent 
and P(L > 2|Z ) equals P(L > x). Because Z drops out of the problem, the 
optimal choice of yz in this case is simply u = 0. At p = 1, all the default 
indicators become identical and L takes only the values 0 and m. Assuming 


0<xz<™m, we thus have 
POL Sa 7S Loa 1) aT eo a 


for any i = 1,...,m. If we make this substitution in (9.56) and follow the same 
steps leading to (9.15)-(9.16), we arrive at the parameter value satisfying 
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E,,[Z] = 2x43 (9.57) 


that is, u = x,;. The default threshold x; equals @~1(1 — p) where p is the 
common value of the default probabilities p;. This argument therefore suggests 
that u should increase from 0 to 6~1(1 — p) as p increases from 0 to 1. 

The numerical results in Figure 9.7 support this idea. The results apply 
to a portfolio with m = 1000 obligors, c; = 1 and p; = p = 0.002. The figure 
plots variance reduction factors (relative to ordinary Monte Carlo) for the 
estimator (9.55) as a function p for p = 0.05, 0.10, 0.25, 0.50, and 0.80. Each 
curve corresponds to a value of p. For each p, we choose z to make P(L > x) 
close to 1%, though this is not always possible because L is integer-valued. 
The values of x and the loss probabilities are as follows: 


p: 0.05 0.10 0.25 0.50 0.80 
t: 6 6 10 26 43 
P(L >x): 0.6% 1.0% 0.9% 0.9% 1.0% 


The curves in Figure 9.7 indeed show that the optimal u (the point at 
which the variance reduction is greatest) tends to increase with p. Moreover, 
a default probability of p = 0.002 corresponds to a threshold of x; = ®~1(1— 
0.002) = 2.88, and the argument leading to (9.57) asserts that this should be 
close to the optimal u for p close to 1. (Equation (9.57) defines the optimal py 
for p = 1 as x; — ov, but for finite x; the optimal u may not be exactly equal 
to x;, even at p = 1.) The curves in the figure indicate that the limiting value 
of 2.88 is close to optimal for p = 0.50 and p = 0.80, and is very effective 
even at p = 0.25. We know that at p = 0 it would be optimal to take u = 0, 
but the figure shows that with a small increase in p to 0.05 there is already 
substantial benefit to taking u > 0. For the examples in the figure, we find 
variance reduction factors in the range of 30-50 with u chosen optimally for 
loss probabilities near 1%. Greater variance reduction would be achieved at 
smaller loss probabilities. 

This example is simple enough that it does not require simulation — L has 
a binomial distribution conditional on Z, so calculation of P(L > x) reduces 
to a one-dimensional integral. It nevertheless shows the potential effectiveness 
of importance sampling in estimating the tail of the loss distribution. The 
dependence mechanisms used in credit risk models in turn pose interesting 
new challenges for research on variance reduction. 


9.5 Concluding Remarks 


In this chapter, we have presented some applications of Monte Carlo simula- 
tion to risk management. In our discussion of market risk, we have focused 
on the problem of estimating loss probabilities and value-at-risk and detailed 
the use of the delta-gamma approximation as a basis for variance reduction. 
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Fig. 9.7. Variance reduction achieved through importance sampling as a function 
of factor mean p for various levels of p. The optimal u increases with p and is close 
to T} (1 — p) = 0.288 at larger values of p. 


In discussing credit risk, we have described some of the main modeling ap- 
proaches and simulation issues, and described some initial steps in research 
on efficient simulation for portfolio credit risk. 

We have not attempted a comprehensive treatment of the use of simulation 
in risk management — the topic is too broad to permit that here. Simulation 
is widely used in areas of risk management not even touched on in this chapter 
— pension planning and insurance, for example. This section provides some 
additional references to relevant methods and applications. 

In our discussion of market risk, we have focused on portfolios with assets 
whose value must be computed rather than simply observed. This is appro- 
priate for a portfolio of options but overly complicated for, e.g., a portfolio of 
stocks. When time series of asset values are available, extreme value theory is 
useful for quantile estimation. See Bassi, Embrechts, and Kafetzaki [39] for an 
introduction to risk management applications and Embrechts, Kluppelberg, 
and Mikosch [111] for the underlying theory. 

Quasi-Monte Carlo is a natural tool to consider in calculating risk mea- 
sures. Methods for improving uniformity are not, however, specifically suited 
to estimating small probabilities or extreme quantiles. 

Talay and Zheng [344] analyze the discretization error in using an Euler 
approximation to estimate quantiles of the law of a diffusion, with applications 
to value-at-risk. They show that the discretization error (like the variance in 
the central limit theorem (9.7)) involves the reciprocal of the density at the 
quantile and can therefore be very large at extreme quantiles. 

Importance sampling for heavy-tailed distributions is an active area of 
current research, motivated by applications in telecommunications and in in- 
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surance risk theory. Work on this topic includes Asmussen and Binswanger 
[22], Asmussen, Binswanger, and Højgaard [23], and Juneja and Shahabuddin 
[206]. These papers address difficulties in extending importance sampling for 
the classical ruin problem of Example 4.6.3 to the case of heavy-tailed claims. 

At the interface of market and credit risk lies the problem of calculating 
the evolution of credit exposures. The exposure is the amount that would be 
lost if a counterparty defaulted and is thus the positive part of the net market 
value of all contracts with that counterparty, irrespective of the probability 
of default. The exposure in an interest rate swap, for example, is zero at 
inception and at termination and reaches a maximum around a third or half 
of the way into the life of the swap. Simulation is useful in estimating the path 
of the mean exposure and a quantile of the exposure; see Wakeman [354], for 
example. 

Dynamic financial analysis refers to simulation-based techniques for risk 
management of growing use in the insurance industry. These are primarily 
long-term simulations of interest rates and other financial variables coupled 
with insurance losses. See Kaufmann, Gadmer, and Klett [208] for an intro- 
duction. 

As noted in Section 9.4.1, interest rates and default intensities play for- 
mally similar roles in the calculation of bond prices and survival probabilities. 
This analogy leads to HJM-like models of intensities, as presented in Chap- 
ters 13 and 14 of Bielecki and Rutkowski [46]. The simulation methods of 
Sections 3.6 and 3.7 are potentially relevant to these models. 

For more on credit risk modeling and the valuation of credit derivatives, 
see the books by Bielecki and Rutkowski [46] and Duffie [98]. 


A 


Appendix: Convergence and Confidence 
Intervals 


This appendix summarizes basic convergence concepts and the application 
of the central limit theorem to the construction of confidence intervals. The 
results and definitions reviewed in this appendix are covered in greater detail 
in many textbooks on probability and statistics. 


A.1 Convergence Concepts 


Random variables { Xn, n = 1,2,...} on a probability space (Q, F, P) converge 
almost surely (i.e., with probability 1) to a random variable X if 


P ( lim SR 


n —> & 


meaning, more explicitly that the set 
fu EM: lim Xq(w) = X(w)} 
has P-probability 1. The convergence holds in probability if, for all € > 0, 
P(|X, — X|>e) 0 


as n — oo. It holds in p-norm, 0 < p < œ, if all X» and X have finite pth 


moment and 
E{|X, — X|?] — 0. 

Almost sure convergence implies convergence in probability. Convergence 
in probability implies the existence of a deterministic subsequence {nz,k = 
1,2,...} through which Xn, — X almost surely (Chung [85], p.73). Con- 
vergence in p-norm implies convergence in probability. Neither almost sure 
convergence nor convergence in p-norm implies the other. 

For random vectors, convergence in probability, in norm, or almost surely is 
equivalent to the corresponding convergence of each component of the vector. 
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A sequence of estimators {6n, n > 1} is consistent for a parameter 0 if Ô, 
converges to 0 in probability. The sequence is strongly consistent if conver- 


gence holds with probability 1. 


Convergence in Distribution 


Random variables {X,,n = 1,2,...} with distribution functions F„ converge 
in distribution to a random variable X with distribution F` if 


F,, (x) — F(a) for every x € R at which F is continuous. (A.1) 


The random variables X, X1, X2,... need not be defined on a common prob- 
ability space. Convergence in distribution is denoted by the symbol “=”, as 
in 
AGS 

and is also called weak convergence. It is equivalent to the convergence of 
E[f(Xn)] to E[f(X)] for all bounded continuous functions f : R — R. This 
characterization is convenient in extending the definition of weak convergence 
to random elements of more general spaces; in particular, we take it as the 
definition of convergence in distribution for random vectors. 

Convergence in distribution of random variables X,, to X is also equivalent 
to pointwise convergence of their characteristic functions: 

itXn] _ itX 
lim E [e | =E fe ] 
for all t € R, with i = y—1. More precisely, X, = X implies convergence 
of the characteristic functions; and if the characteristic functions of the Xn 
converge pointwise to a function continuous at zero, then this limit is the 
characteristic function of a random variable to which the X, then converge 
in distribution. 

Convergence in distribution is implied by convergence in probability, hence 
also by almost sure convergence and by convergence in norm. If Xn = c with 
c a constant, then the X, converge to c in probability; this follows from (A.1) 
when F(x) = 1{z > c}. 

Suppose random variables {X p,n > 1} and {Yn, n > 1} are all defined on 
a common probability space and that Xn => X and Yn => c, with c a constant. 


Then 
Xn +YÝn > X+c and X,Y, > Xc. (A.2) 


The first assertion in (A.2) is sometimes called Slutsky’s Theorem. 

The requirement that the limit of the Y, be constant is important. If 
Xn > X and Yn => Y, it is not in general the case that Xn + Yn > X +Y 
or XnYn => XY. These limits do hold under the stronger hypothesis that 
(Xn, Yn) => (X,Y). Indeed, any bounded continuous function of x+y or ry can 
be written as a bounded continuous function of (x,y), and weak convergence 


A.2 Central Limit Theorem and Confidence Intervals 541 


of vectors is defined by convergence of expectations of all bounded continuous 


functions. 

Suppose {N,,,n = 1,2,...} is a nondecreasing process taking positive in- 
teger values and increasing to infinity with probability 1. Suppose Xn => X. 
If {Nnn = 1,2,...} and {Xn, n = 1,2,...} are independent sequences, then 
Xn, => X. This holds even without independence if N,/n = c for some con- 
stant c > 0, a result sometimes called Anscombe’s Theorem. We need this 
result in (1.12). See also Theorem 7.3.2 of Chung [85]. 


Convergence of Moments 


Because the mapping x + x”, r > 0, is unbounded, convergence in distri- 
bution does not imply convergence of moments. Suppose Xn = X. Then a 
necessary and sufficient condition for the convergence of E[| Xn |"] to E[|X|"] is 
uniform integrability: 


jim sup E [|X |"1{|X|" > c}] = 0. (A.3) 


By the dominated convergence theorem, a simple sufficient condition is the 
existence of an integrable random variable Y such that |X,,|" < Y for all n. 
Another sufficient condition is 


sup E ||X,|"**| < oo, 
n>i 


for some e > 0. 


A.2 Central Limit Theorem and Confidence Intervals 


The elementary central limit theorem states the following: If X1, X2,... are 
independent and identically distributed with expectation u and variance o°, 


0<oa< oo, then the sample mean 
1 n 
i= 


satisfies z 
X. nH 
o//n 
with N(0,1) denoting the standard normal distribution. This is proved by 
showing that the characteristic function of the expression on the left converges 
to the characteristic function of the standard normal distribution. 
If X1, X2,... are i.i.d. random vectors with mean vector u and covariance 
matrix +3, then 


= N(0, 1), (A.4) 
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Jn Xn T H] a N(O, x); 


where N (0, £) denotes the multivariate normal distribution with mean 0 and 
covariance matrix X. This result can be deduced from (A.4) by considering 
all linear combinations of the components of the vector Xn. 

By the definition of convergence in distribution, (A.4) means that for all 


x ER, 7 
Xn — pL 
P| ——— < P 
( ajn ~ e) — 
with ® the standard cumulative normal distribution. From this it follows that 
the probability that an interval of the form 


0 < a,b < œœ, covers u approaches ®(b) — &(—a) as n — oo. We can choose a 
and b so that this limiting probability is 1— ô, for any ô > 0. Among all choices 
of a and b for which (b) — ®(—a) = 1 — 6, the values minimizing the length 
of the interval (—a,b) are given by a = b = 25/2, where 1 — ®(z5/2) = 6/2. 
The interval E 5 
covers u with probability approaching 1 — 6 as n — oo and is in this sense an 
asymptotically valid confidence interval for u. 

Now let sn be any consistent estimator of o, meaning that Sn => o. Because 
o > 0, we may modify s, so that it is always positive without affecting 
consistency, and then we have o/s, = 1. From (A.2) and (A.4) it follows that 


Xn — Ul 


N(0,1 
SE (0, 1), 


and thus that : 
Xn E Zs ja A.6 

n 6/2 Jn ( ) 
is also an asymptotically valid 1 — ô confidence interval. Because ø is typically 
unknown, this interval is of more practical use than (A.5); often, 


the sample standard deviation of X1,..., Xn. 

If the X; are normally distributed, then the ratio in (A.4) has the standard 
normal distribution for all n > 1. It follows that in this case the interval (A.5) 
covers u with probability 1 — 6 for all n > 1. With s, the sample standard 


deviation, 
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Xn — pl ; 
for all n > 2; i.e., the ratio on the left has the t distribution with n — 1 degrees 
of freedom. Accordingly, if we replace z5/2 with t,_1,5/2, the 1 — 6/2 quantile 
of the t,_1 distribution, the interval 


Xn E tn-1,8/2 e 

covers u with probability 1—6. Even if the X; are not normally distributed, this 
produces a more conservative confidence interval because the t multiplier is 
always larger than the corresponding z multiplier. But for even modest values 
of n, the multipliers are nearly equal so we do not stress this distinction. 

In addition to providing information about the precision of an estimator, 
a confidence interval is useful in sample-size determination. From (A.5) we 
find that the number of replications required to achieve a confidence interval 


halfwidth of € is a 
2520 


€e = 
c2 


(A.7) 


If o is unknown, a two-stage procedure uses an initial set of replications to 
estimate it and then uses this estimate in (A.7) to estimate the total sample 
size required. 

Similar error bounds and procedures can be derived from Chebyshev’s 
inequality, which states that 


quem < =) Se 
P(x u| < a zlep, 
for all 6 > 0. This is valid for all n > 1, but is more conservative than the 
normal approximation. 

In (A.4) we have presented only the most elementary form of the central 
limit theorem. The sample mean and other estimators are asymptotically nor- 
mal under more general conditions. The elementary result suffices whenever 
we simulate independent and identically distributed replications. But more 
general results are needed to handle other settings that arise in simulation; 
we mention two. 


o Variance reduction techniques often introduce dependence across replica- 
tions. In some cases (e.g., control variates), the dependence becomes negli- 
gible as the number of replications increases, but in others (e.g., stratified 
sampling, Latin hypercube sampling) it does not. Simulating batches and 
allowing dependence within batches but not across batches reduces the 
problem to one of independent replications; but even without this limita- 
tion a more general central limit theorem often applies. See, for example, 
the discussion of output analysis in Section 4.4. 
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o We are often interested in how the error in an estimator changes as some 
parameter of a simulation changes along with an increase in the number 
of replications. This applies, for example, when we discretize a model with 
time step h. Rather than fixing h and letting the number of replications n 
increase, we may want to analyze the convergence of an estimator as both 
h — 0 and n — oo. Because changing h changes the distribution from which 
the replications are sampled, this setting requires a central limit theorem for 
an array (as opposed to a sequence) of random variables. The Lindeberg- 
Feller central limit theorem (as in, e.g., Chung [85], Section 7.2) is the key 


result of this type. 


B 
Appendix: Results from Stochastic Calculus 


This appendix records some background results on stochastic integrals, sto- 
chastic differential equations, martingales, and measure transformations. For 
more comprehensive treatments of these topics see, e.g., Hunt and Kennedy 
[191], Karatzas and Shreve [207], Øksendal [284], Protter [300], and Revuz 
and Yor [306]. 


B.1 Itô’s Formula 


Our starting point is a probability space (Q, F, P) with a filtration {F;,t > 0}, 
meaning a family of sub-c-algebras of F with Fs C F, whenever s < t. 
Depending on the context, t may range over all of [0, 00) or be restricted to 
an interval [0, T] for some fixed finite T. Some results require that the filtration 
satisfy the “usual conditions,” so we assume these hold: Fo contains all subsets 
of sets in F having P-probability 0, and each F, is the intersection of all F 
with t > s. A stochastic process {X(t),t > 0} is adapted to the filtration 
if X(t) € F for all t > 0, meaning that X(t) is F;-measurable. We assume 
that on this filtered probability space is defined a k-dimensional Brownian 
motion W = (W1,..., Wk)! with respect to the filtration. In particular, W 
is adapted and if t > s then W(t) — W (s) is independent of F;. We denote 
by {FW ,t > 0} the filtration generated by the Brownian motion with Fẹ 
augmented so that the usual conditions are satisfied. 

For any vector or matrix a, let ||a|| denote the square root of the sum of 


squared entries of a. 
If {y(t),0 < t < T} is an R*-valued adapted process for which 


P (/ Iv)? dt < ~| =l, 


for some fixed T, then the Ito integral 
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f Jar a (B.1) 


is well-defined for all ¢ in [0, T]. This is not a routine extension of the ordinary 
integral because the paths of W have infinite variation. See, e.g., Karatzas 
and Shreve [207] or Øksendal [284] for a development of the Ité integral. We 
can replace the integrand y with an R¢**-valued process b by applying (B.1) 
to each row of b; this produces a d-dimensional vector, each component of 


which is a stochastic integral. 
An #¢-valued process {X(t),0 < t < T} is an Itô process if it can be 


represented as 
X (t) =x+ | a(u)du+ f owaww), 0<t<T, (B.2) 
0 0 


where X (0) is Fo-measurable, a is an R¢-valued adapted process satisfying 


1P( [ olto) = (= reer (B.3) 


and b is an R2**-yalued adapted process satisfying 


T 
P (/ b(u)\|2 dt < x) Si (B.4) 


The notation 
dX(t) = a(t) dt + b(t) dW(t) (B.5) 


is shorthand for (B.2). 


Theorem B.1.1 (It6’s Formula). Let X be an ¥%-valued Itô process as in 
(B.2) and let f : [0,T] x R — R be continuously differentiable in its first 
argument and twice continuously differentiable in its second argument. Let 
E(t) = b(t)b(t)'. Then Y(t) = f(t, X(t)) is an Ité process with 


dY (t) 
d d 2 
= Le, X (D) dt + É LEXO) +4 2 TOXO) d 


o? f 


af : af | . 
— HE Fa XOMO+E D ge Fete X EC) | at 


d 
yen (t, X (t))b,. (t) dW (t) (B.6) 


with bi. the ith row of b; i.e., 
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d 


YA = FOXO) + | he (u X(w)) +S (u, X(w)ai(u) 


1=1 


o? f 


d 
1 
+3 » Be 
+f 
This is the chain rule of stochastic calculus. It differs from the correspond- 
ing result in ordinary calculus through the appearance of second derivatives 
of f in the dt term. 


If f has no explicit dependence on ¢ (so that Y(t) = f(X(¢))), equation 
(B.6) simplifies to 


d a 2 g 
dY (t) = l Lxi (X Va; (t) ai ł 2 Oa aay one dt 
i=] 


A of 
ta t))b;.(t) dW (t). (B.8) 


ae X (u))%i o) du 


))bi. (u) dW (u). (B.7) 


If X, a, and b are scalar processes, it becomes 
dY (t) = (FX Oale) + EFX OPE) dt + FXE) dW). (B-9) 


By applying Theorem B.1.1 to the mapping (x,y) +» xy we obtain the 
following useful special case: 


Corollary B.1.1 (Product Rule). Let (X,Y) be an Ité process on R?, 


X(t)\ — fax(t) bx (t)! 
d ey = bare dt + r) dW (t), 
with ax, ay scalar valued, by ,by taking values in RE, and W a k-dimensional 
Brownian motion. Then 
d(X(t)Y (t)) = X(t) dY (t) +Y(t)dX(t)+bx (t) ' by (t) dt  (B.10) 
= [X (t)ay (t) + Y (t)ax (t) + bx (t) | by (t)] dt 
+ [X (t)by (t) +Y (t)bx (t)]' dW (8). 


This result can be interpreted as an integration-by-parts formula for Ito 
calculus because (after rearranging terms) it relates X dY to Y dX. 
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B.2 Stochastic Differential Equations 


Most models used in financial engineering can be described through a sto- 
chastic differential equation (SDE) of the form 


dX (t) = a(X (t), t) dt +(X (t), t) aW (t), X(0) = Xo, (B.11) 


with W a k-dimensional Brownian motion, a mapping R? x [0, 00) into Rt, b 
mapping R? x [0, 00) into R2**, and Xo a random d-vector independent of W. 
In what sense do the functions a and b and the initial condition Xo determine 
a process X? 

A strong solution to (B.11) on an interval [0, T] is an Itô process {X (t), 0 < 
t < T} for which P(X (0) = Xo) = 1 and 


XOXO + f a(X(u).wdu+ f WX) aww), 0<t<T. 


The requirement that X be an It6 process imposes conditions (B.3) and (B.4) 
on the processes {a(X(t),t),0<t<T} and {b(X(¢),t),0<t<T}. We now 
state the main result on strong solutions to SDEs. 


Theorem B.2.1 (Existence and Uniqueness of Solutions). Suppose E[|| Xo||?] 
is finite and that there is a constant K for which the following conditions are 
satisfied: 

(i) |ja(x,t) — aly, t)|| + lolx, t) — bly, t)|| < Kll — y|| (Lipschitz condition) 
(it) ||a(a, t) || + blz, t| < A (1 + |lz||) (Linear growth condition) 

for allt € [0,T] and all x,y € Rt. Then the SDE (B.11) admits a strong 
solution X. This solution is unique in the sense that if X is also a solution, 
then P(X(t) = X(t) Vt € [0,T]) = 1. For all t € [0,T], the solution satisfies 
E[||X(¢)||"] < 00. 

Proofs of this result can be found, e.g., Hunt and Kennedy [191], Karatzas 
and Shreve [207], and Øksendal [284]. 

In the definition of a strong solution, the probability space and driving 
Brownian motion W are specified as part of the SDE together with the func- 
tions a and b. If we ask for just a weak solution, we are free to define a different 
probability space supporting its own Brownian motion on which (B.11) holds. 
For modeling purposes we are generally only concerned about the law of a 
process, so there is little reason to insist on a particular probability space. 
The most relevant issue is whether a, b, and the distribution of Xo uniquely 
determine the law of any weak solution to (B.11). The strong uniqueness im- 
plied by Theorem B.2.1 is more than enough to ensure this, but the simplicity 
of the conditions on a and b make this a particularly convenient result. 

The square-root function is not Lipschitz continuous, so Theorem B.2.1 
does not apply to the square-root diffusion of Section 3.4. A result covering 
that model appears in Karatzas and Shreve [207], p.291. See also Krylov [216] 
for existence and uniqueness results under conditions weaker than those in 


Theorem B.2.1. 
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Markov Property 


The form of the SDE (B.11) suggests that X(t) provides a complete descrip- 
tion of the state of the system modeled by X: the dynamics of X in (B.11) do 
not depend on the past evolution of X except through its current value X(t). 
This is the intuitive content of the Markov property, which is made precise 
through the requirement that 


E[f(Xe)|Fs] = Elf(Xe) [Xs] (B.12) 


for allO < s < t < T and all bounded Borel functions f : RI — R. The process 
X isa strong Markov process if (B.12) continues to hold with s replaced by any 
stopping time (with respect to {7;}) taking values in [0, T]. This property is 
confirmed by the following result, proved in, for example, Hunt and Kennedy 
[191] and Øksendal [284]: 


Theorem B.2.2 (Markov property). Under the conditions of Theorem B.2.1, 
the solution X of (B.11) is a strong Markov process. 


The Gaussian Case 


An *-valued process {€(t),t € [0, T]} is called Gaussian if for all n = 1, 2,... 
and all t1,...,tn € [0, T], the vector formed by concatenating €(t1),..., €(tn) 
has a multivariate normal distribution. Brownian motion is an example of a 
Gaussian process. A Gaussian process need not be Markovian and the solution 
to an SDE need not be Gaussian. But the next result tells us that in the case 
of a linear SDE the solution is indeed Gaussian: 


Theorem B.2.3 (Linear SDE). Let A, c, and D be bounded measurable func- 
tions on [0,T] taking values in R2*4, Rİ, and RIXE, respectively. Let Xo be 
normally distributed on R* independent of the k-dimensional Brownian mo- 
tion W. Then the solution of the SDE 


dX (t) = (AÐX (t) + ct) dt + D(t) dW (t), X(0) = Xo (B.13) 


is a Gaussian process. 


For a proof see Karatzas and Shreve [207], Problem 5.6.2 (solution in- 
cluded). 

The law of a Gaussian process is completely specified by its first- and 
second-order moments. These can be given fairly explicitly in the case of 
(B.13). We consider only the case of constant A(t) = A; for the general case 
see, e.g., Karatzas and Shreve [207]. 


Proposition B.2.1 If A(t) = A in Theorem B.2.3, then 


X(t) = e4*X(0) + | eAl—Ye(u) du + l eA) D(u) dW(u).  (B.14) 
0 0 
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The mean m(t) = E[X(t)] is given by 


m(t) = e^tm(0) + [ eAlt— o(u) du 


and the covariance by 
min(s,t) k 
EXO -DAT AIDU Tet du 


That the process in (B.14) satisfies (B.13) can be verified using Itô’s for- 
mula. The expressions for the moments of X then follow from simple rules for 
calculating means and covariances of stochastic integrals with deterministic 
integrands, which we now make more explicit. If o : [0, T] — RF satisfies 


/ ONCE ee (B.15) 


then 


E | J “out awo) =0, (B.16) 


and 


Var J on" aH) =f lolu)l|? du. (B.17) 


If o1,02 both satisfy (B.15), then for any s,t € [0, T] 


eo | | a wa / oa(u)" dW (w) 2 / O 


B.3 Martingales 


This section summarizes some results relating stochastic integrals and mar- 


tingales. 
A real-valued adapted process {X (t), t > 0} is a martingale if 


(i) E| Xl] < œœ for all t > 0; 
(ii) E[X:|F:] = Xs for all 0 < s < t< œ. 


Define a martingale on [0, T] by restricting t and s to this interval. Throughout 
most of this book, we implicitly assume the integrability property (i) in calling 
a process a martingale. 

The process X is a local martingale if there exists a sequence of stopping 
times {7,,n = 1,2,...} with 7, 7 co for which each process Xn (t) = X(tATp) 
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is a martingale. All martingales are local martingales, but the converse does 
not hold. If, however, X is a local martingale and 


E | sup xl < œ, 
O<t<T 


then X is in fact a martingale on [0, T]. This follows from the dominated 
convergence theorem, as demonstrated in Protter [300], Theorem I.47. 

It is common in applied work to assume that the solution to an SDE with 
no drift term (a in (B.11)) is a martingale and, more generally, that any 
process of the form 


X(t) = XO+ fw)" dW (u), X(0) € Fo, (B.18) 


is a martingale. We make this assumption in several places throughout this 
book, usually to derive implications for asset price dynamics from the absence 
of arbitrage. But the process in (B.18) is not automatically a martingale with- 
out additional hypotheses. The following result gives some indication of what 
we are leaving out in assuming the martingale property holds. 


Theorem B.3.1 (Stochastic integrals as martingales). Let X be as in (B.18) 
with ; 
f ly(u)l|? du <œ, a.s., for all t, (B.19) 
0 


then (i) X is a local martingale. (ii) If X(t) > 0, a.s., for all t, then 
E XF] Xs, OSs <t; 


i.e., X is a supermartingale. If E[X;] is constant then X is a martingale. (iii) 
If E[X(0)?] < œ and if for all t > 0, 


E | [tne i2du] < œ; (B.20) 


then X is a martingale and 
E[X(t)?] = EX (0)?] + E | f DOKA | 


From (i) we see that a “driftless” process is a local martingale but not 
necessarily a martingale. Property (ii) follows from the fact (Revuz and Yor 
[306], p.123) that every nonnegative local martingale is a supermartingale and 
any supermartingale with constant expectation is a martingale. Property (iii) 
states that as long as we restrict ourselves to integrands satisfying (B.20), 
stochastic integrals are indeed martingales and in fact square-integrable mar- 
tingales. This is a special case of general results on stochastic integration with 
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respect to martingales; for example, Proposition 3.2.10 of Karatzas and Shreve 


[207]. 
We frequently work with processes of the form 


Y (t) = Y (0) exp (-4 J POLE / a aw(u)) _ (B21) 


If (B.19) holds, then 


t t 

xO =- f OPd f aT aw (B.22) 

is an It6 process and hence (by Theorem B.1.1) Y is too. An application of 
Itô’s formula shows that Y satisfies 


dY (t) = Y(t)y(u)' dW (u), 


so Y is at least a local martingale. If Y (0) is nonnegative and if E[Y(¢)] 
is constant, then we see from part (ii) of Theorem B.3.1 that Y is in fact a 
martingale—an exponential martingale. In the special case that y is determin- 
istic and bounded on finite intervals, X is a Gaussian process; with Y (0) = 1 


we have 


E[Y (t)] = Elexp(X(t))| 


exp (ELX (t) + 5Var[X (t)]) 
ee (-3 [ oP f [rel du) = 


using (B.16) and (B.17). This verifies that Y is a martingale. 

Theorem B.3.1 states that under appropriate additional conditions, sto- 
chastic integrals are martingales. The next result may be paraphrased as stat- 
ing that if Brownian motion is the only source of uncertainty, then all martin- 
gales are stochastic integrals. To make this precise, let FW be the o-algebra 
generated by {W(u),0 < u < t} augmented to include all subsets of null sets. 
We now specialize to the filtration {FW }. 


Theorem B.3.2 (Martingale representation theorem). If X is a local mar- 
tingale with respect to {FW}, then there exists a process y such that (B.18) 
holds. If X is a square-integrable martingale, then y satisfies (B.20). 


The second part is proved on pp.182-184 of Karatzas and Shreve [207]; a 
proof of the first part is provided in Hunt and Kennedy [191], p.113. 

A simple consequence of this result is that any integrable random variable 
€ € F¥ has a representation of the form 


ESE + f r(u)T aww) (B-23) 
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This follows by applying Theorem B.3.2 to the martingale X(t) = E[E| F4. 
Equation (B.23) says that we can synthesize the “payoff” €—E|é] by “trading” 
in the underlying Brownian motion W. 

A further consequence of Theorem B.3.2 is that any strictly positive local 
martingale (with respect to {F/”}) has a representation of the form (B.21). 
More precisely, suppose Y is strictly positive and Y(t) = Y (t)/Y (0) is a local 
martingale. From Theorem B.3.2, we get a representation of the form 


Y (t) = / z(u)! dW (u). 


Because Y is strictly positive, we can define X(t) = log Y and Ito’s formula 
shows that X satisfies (B.22) with y = 7/Y. But Y(t) = exp(X(t)) and 
Y(t) = Y(0) exp(X(t)), so Y has the representation in (B.21). 
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Let X be a nonnegative random variable on (Q, F, P) with E[X] = 1. Define 
Q : F — [0,1] by setting 


Q(A) = Ex] = | X(w)dP(w), AEF. (B.24) 


with 14 the indicator function of the set A. It is easy to verify that the set 
function Q is a probability measure on (Q, F). It is absolutely continuous with 
respect to P, meaning that 


Q(A) >0=> P(A) >0 


for every A € F. The Radon-Nikodym Theorem states that all such measures 
arise in this way: if P and Q are probability measures on (Q, F) and if Q is 
absolutely continuous with respect to P, then there exists a random variable 
X such that (B.24) holds. Moreover, X is unique in the sense that if (B.24) 
holds for all A € F for some other random variable X’, then P(X = X’) = 1. 
Because Q is a probability, we must have P(X > 0) = 1 and 


[xe eat 
Q 


The random variable X is commonly written as dQ/dP and called the Radon- 
Nikodym derivative or likelihood ratio of Q with respect to P. 

If P is also absolutely continuous with respect to Q, then P and Q are 
equivalent. Equivalent measures agree about which events have probability 
zero. If P and Q are equivalent, then 


P (R k 
a5 ha) 
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To illustrate these ideas, suppose the random variable Z on (Q, F) has a 
density g under probability measure P; i.e., 


zZ 


P(Z <z)= / g(x) de 


for all z € R. Let f be another probability density on R with the property 
that g(z) = 0 = f(z) = 0 and define a new probability measure Q on (Q, F) 


by setting HD 
AA) = Er Bar 


we have subscripted the expectation to emphasize that it is taken with respect 
to P. Interpret f(z)/g(z) to be 1 whenever both numerator and denominator 


are 0. Clearly, P(f(Z)/g(Z) > 0) = 1 and 


f(Z(w)) © f(a) 
is o P= fay 9a) de = 1. 


Under the new measure, the event {Z < z} has probability 


E CON 
ea I Pr AO) nae 

_ fF f@) 

~ J (a)? aS 


=f f(a) dx. 


Thus, under Q, the random variable Z has density f. 

As a special case, let g be the standard normal density and let f be the 
normal density with mean y and variance 1. Then f(z)/g(x) = exp(ur—p?/2). 
If Z has the standard normal distribution under P and if we define Q by setting 


dQ pe o- gH +uZ 
dP Í 


then Z has mean u under Q and Z — u has the standard normal distribution 


under Q. 
A similar calculation shows that if, under some probability measure Ph, 


the random variables Z1,...,Zn are independent with densities g1,...,Ā9gn, 
and if we define Q, by setting 


dP, j=1 gi(Z;) 


then under Qn, the Z; are independent with densities f;. In particular, if the 
Z; are independent standard normals under P, and we set 


B.4 Change of Measure 555 


an = exp (- boa ae Soe j , (B.25) 


then 
Z1 — H1, Z2 — U2,- , Zn — Hn (B.26) 


are independent standard normals under Qn. 

Consider, now, what happens in this example as n becomes large. For 
each n, the support of the multivariate normal vector (Z1,..., Zn) is all of 
R” regardless of its mean and the measures P, and Qn are equivalent by 
construction. Suppose that under P the random variables Z1, Z2,... are inde- 
pendent standard normals, and suppose there is another measure Q on (Q, F) 
such that the Z; are independent normals with mean p and variance 1 under 


Q. Then 
1 Tm 
PA tee 7 So) S41, 
(in EY 


whereas if u Æ O this event has @-probability zero. Not only do P and Q 
fail to be equivalent, they live on entirely different sets. The mutual absolute 
continuity that holds for each n through (B.25) breaks down in the limit as 
n — OO. 

Our main interest in measure transformations lies in changes of measure 
that have the effect of adding a drift to a Brownian motion. This may be 
viewed as an extension of (B.25) and (B.26). In discussing measure trans- 
formations for continuous-time processes we restrict ourselves to finite time 
intervals, just as the transformation in (B.25) is feasible only for finite n. 


Girsanov’s Theorem 


We now generalize the basic transformation in (B.24). For the filtration {F;} 
of (Q, F, P), let P; denote the restriction of P to F;. Let {X(t),t € [0,7]} 
be a nonnegative martingale with respect to {F;} and suppose E[X (7’)| = 1. 
Define a probability measure Q; on F; by setting 


Qil A) = Ep|1,X(t)] = Ep, [14X (t)], AE Fe 
i.e., for each t € [0, T], 


dQt 


Then Q; is the restriction of Qr to Fı because for any A E€ Fq, 
Qr(A) = Ep[1aX(T)] = Ep[LaEp[X(T)|Fi]] = Ep[14 X (t)] = Q:(A). 


In this sense, the measures {Q:,t € [0,7]} are consistent. If X is strictly 
positive, then Q, and P, are equivalent for all t. To replace [0,7] with [0, oo) 
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and still have all Q; consistent, we would need to add the requirement that 


the martingale X be uniformly integrable (see (A.3). 
Suppose, now, that P and Q are equivalent probability measures on (Q, F) 


with Radon-Nikodym derivative dQ/dP. Define 


(E) = Ep Fag teo T: 


We claim that (dQ/dP) is a martingale and equals the Radon-Nikodym deriv- 
ative of the restriction of Q to F, with respect to the restriction of P to Fy. 
The martingale property is immediate from the definition. For the second 
claim, observe that for any A € Fi, 


Ep E (33) =Ep Er eal =Ep Ce = Q(A). 


We may summarize this discussion by saying that a nonnegative, unit- 
mean martingale defines a consistent (with respect to {F;,t € [0,7]}) family 
of probability measures, and, conversely, the Radon-Nikodym derivatives for 
such a family define a unit-mean, nonnegative martingale. 

The following simple rule for applying a change of measure to conditional 
expectations arises frequently in mathematical finance, most notably in ap- 
plying a change of numeraire: 


Proposition B.4.1 If P and Q are equivalent and Eo|| X|] < oo, then 
dQ\~" dQ 
Eg |X = | — Ep | X — 
olXlFl= (F) Er xA] 


This follows immediately from the definition of (dQ/dP),; for an explicit 
proof, see Musiela and Rutkowski [275], p.458. 

We can now state Girsanov’s Theorem. In the following, let {W(t),t € 
(0, 7]} denote a standard k-dimensional Brownian motion on (Q, F, P) and 
let {FW ,t € [0, T]} denote the filtration generated by W augmented to include 
all subsets of sets having P-probability 0. 


Theorem B.4.1 (Girsanov Theorem). (i) Let y be an R*-valued process 
adapted to {F/"} satisfying (B.19) for t € [0,T] and let 


x0) =en(-2 fldu rawa). E2 
a. f, raw) 


If Ep|X(T)| = 1, then {X(t),t € [0,7]} is a martingale and the measure Q 
on (Q, FM) defined by 


is equivalent to P. Under Q, the process 
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t 
Q Ê 
wet) Ê Wwe) — J Joda ¢€ (0,7) 
0 


is a standard Brownian motion with respect to {F¥ }. (ii) Conversely, if Q 
is a probability measure on (Q, FW) equivalent to the restriction of P to Fas 
then (dQ/dP), admits the representation in (B.27) for some y and W® is a 
standard Brownian motion under Q. 


Thus, the change of measure associated with a change of “drift” in a 
Brownian motion over a finite horizon is an absolutely continuous change of 
measure, and every absolutely continuous change of measure for Brownian 
motion is of this type. 

There are several different results, applicable at different levels of gen- 
erality, known as Girsanov’s Theorem. The formulation given here is more 
precisely Girsanov’s Theorem for Brownian motion and is closest to the one 
proved in Section 5.2.2 of Hunt and Kennedy [191]. See Revuz and Yor [306] 
for historical remarks as well as a more general formulation. Usually, only part 
(i) of this result is called Girsanov’s Theorem. Part (ii) is a consequence of 
the Martingale Representation Theorem: as a positive unit-mean martingale, 
(dQ/dP); must be of the form (B.27) and then Q and P must be related 
as in part (i). We have included the converse because of its importance in 
the theory of derivative pricing. It assures us that when we change from the 
objective probability measure to an equivalent martingale measure, the drifts 
in asset prices may change but their volatilities may not. 

The requirement that Ep[X(T)| = 1 in part (i) of Theorem B.4.1 is needed 
to ensure that X is a martingale and not merely a local martingale; see Theo- 
rem B.3.1. The Novikov condition is a widely cited sufficient condition for this 
requirement (see, e.g., Section 3.5 of Karatzas and Shreve [207] for a proof): 


Proposition B.4.2 (Novikov condition). If 


E es g TOKI] < 00, 


then X in (B.27) is a martingale on [0, T]. 


C 


Appendix: The Term Structure of Interest 
Rates 


The term structure of interest rates refers to the dependence of interest rates 
on maturity. There are several equivalent ways of recording this relationship. 
This appendix reviews terminology used for this purpose and describes some 
of the most important interest rate derivative securities. 


C.1 Term Structure Terminology 


A unit of account (e.g., a dollar) invested at a continuously compounded rate 
R grows to a value of e7 over the interval from time 0 to T. If instead the 
investment earns simple interest over [0,7], the value grows to 1 + RT over 
this interval. The discount factors associated with continuous compounding 
and simple interest are thus e~*? and 1/(1+ RT), respectively. 

Many fixed income securities (including US Treasury bonds) follow an 
intermediate convention in determining how interest accrues: an interest rate 
is quoted on an annual basis with semi-annual compounding. In this case, an 
initial investment of 1 grows to a value of 1+(R/2) at the end of half a year, to 
a value of (1+(R/2))? at the end of one year, and so on. A bit more generally, 
if we let 6 denote the fraction of a year over which interest is compounded 
(with 6 = 1/2 and 6 = 1/4 the most important cases), the interest accrued 
over nô years is (1+ ôR)” — 1. Depending on what day-count convention is 
used, the exact lengths of nominally equal six-month or three-month intervals 
may vary; if we therefore generalize to allow unequal fractions 61, 62,..., then 
at the end of n periods, an initial investment of 1 grows to a value of 


n 


[G + 4&2). 


w=1 


The associated discount factor is the reciprocal of this expression. 
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Bonds, Yields, and Forward Rates 


Let B(t, T) denote the price at time t of a security making a single payment 
of 1 at time T, T > t. This is a zero-coupon (or pure-discount) bond with 
maturity T. A coupon bond makes payments at multiple dates and may be 
viewed as a portfolio of zero-coupon bonds. A coupon bond paying c; at Ti, 
i = 1,...,n, and a principal payment of 1 at T, has a value prior to maturity 
of 


i=0(t) 


with (t) the index of the next coupon date, defined by 
Tea- < t < Tey. 


In a world with a constant continuously compounded interest rate, an 
investor could replicate a zero-coupon bond with maturity T by investing 
e7 RT in an interest bearing account at time 0 and letting it grow to a value 
of 1 at time T. It follows that B(0, T) = e~*” in this setting. 

More generally, if the continuously compounded rate at time t (the short 
rate) is given by a stochastic process r(t), an investment of 1 at time 0 grows 


to a value of 
t 
O(t) = exp (| r(u) iu) 
0 


at time t. As explained in Chapter 1, the price of a bond is given by 


B(0,T) =E ex ¢ f e a) (C.2) 


the expectation taken under the risk-neutral measure. This is the only identity 
in this section that involves probability. 

The yield of a bond may be interpreted as the interest rate implied by the 
price of the bond; the yield therefore depends on the compounding convention 
assumed for the implied interest rate. The continuously compounded yield 
Y(t, T) for a zero-coupon bond maturing at T is defined by | 


1 
T-t 


B(t, T) =e YOTT-9) or Y(t,T)=- log B(t, T). (C.3) 


The continuously compounded yield Y(t) for the bond in (C.1) is defined by 
the condition 


n 
Belt) = YOT- Y ge OT- 
i=£(t) 


Yields are more commonly quoted on a semi-annual basis. The yield Y5(t, T) 
associated with a compounding interval ô solves the equation 
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1 


BJS (1+ 6Y5(t,T))”’ 


when n = (T'—t)/6 is an integer. This extends to arbitrary t € (0,7) through 
the convention that interest accrues linearly between compounding dates. The 
yield of a coupon bond is similarly defined by discounting the coupons as well 
as the principal payment. 

A forward rate is an interest rate set today for borrowing or lending at 
some date in the future. Consider, first, the case of a forward rate based on 
simple interest, and let F(t, T1, T2) denote the forward rate fixed at time t for 
the interval [T), T2], with t < Tı < Tə. An investor entering into an agreement 
at time t to borrow 1 at time Tı and repay the loan at time T pays interest 
at rate F(t, Ti, T2). More explicitly, the investor receives 1 at T} and pays 
1+ F(t, Ti, T2)(T> — Tı) at To; no other payments are exchanged. 

An arbitrage argument shows that forward rates are determined by bond 
prices. At time t, an investor could buy a zero-coupon bond maturing at T), 
funding the purchase by issuing bonds maturing at Tə. If the number of bonds 


k is chosen to satisfy 
kB(t, Tz) = Bt, Tı), 


there is no net cashflow at time t. The investor will receive 1 at T} and pay k 
at T>. To preclude arbitrage, the amount paid at Tə in this transaction must 
be the same as the amount paid in the forward rate transaction, so 


k=1+ F(t, Ti, Tə)(T2 = Ti). 
But k = B(t,T,)/B(t,T2), so we conclude that 


1 /B(t,T:) — B(t, Tə) 
To = Ti ( B(t, Tə) ) l oe) 


F(t, T,,T2) = 


For much of the financial industry, the most important benchmark interest 
rates are the London Inter-Bank Offered Rates or LIBOR. LIBOR is calcu- 
lated daily through an average of rates offered by select banks in London. 
Separate rates are quoted for different maturities (e.g., three months and six 
months) and different currencies. LIBOR is quoted as a simple annualized 
interest rate. A forward LIBOR rate is a special case of (C.4) with a fixed 
length ô = T — T; for the accrual period, typically with 6 = 1/2 or 6 = 1/4. 
Thus, the 0-year forward LIBOR rate at time t with maturity T is 


L(t,T) = F(t,T,T +6) = : (ee) | (C.5) 


In taking forward LIBOR rates as a special case of (C.4), we are ignoring 
credit risk. The discussion leading to (C.4) assumes that bonds always make 
their scheduled payments, that issuers never default. But the banks whose 
rates set LIBOR may indeed default and this risk is presumably reflected in 
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the rates they offer. Equation (C.5) ignores this feature. It is a convenient 


simplification often used in practice. 

In deriving (C.4) we assumed a simple forward rate, but we could just 
as well have used a continuously compounded forward rate f(t, 71,72). The 
interest paid must be the same regardless of the compounding convention, so 


we must have 
exp (F(t, Ti, T2)(T> = T1)) = |= F(t, Ti, Tə)(Tə Sa Ti). 


With (C.4), this implies 


log B(t, Tı) — log B(t, T2) 


t, Ti, 12) = 
iG, 1; 2) Tə — T} 


for the continuously compounded forward rate for the accrual interval [T;, To]. 

Now define f(t, T) to be the continuously compounded forward rate fixed 
at t for the instant T. This is the limit (assuming it exists) of f(t,7,T + h) 
as h approaches 0, and is thus given by 


fT) =- 2 log B(t, T). 


Inverting this relationship and using B(T,T) = 1, we get 


B(t, T) = exp (-/ f(t,u) du) : 


Thus, the forward curve f(t, -) is characterized by the property that discount- 
ing along this curve reproduces time-t bond prices. 
Comparison with (C.3) reveals that 


l T 
YET) = p | ftw aw 


i.e., that yields are averages over forward rates. This suggests that forward 
rates are more fundamental quantities than yields and thus potentially a more 
attractive starting point in building models of term structure dynamics. 


Swaps and Swap Rates 


In a standard interest rate swap, two parties agree to exchange payments tied 
to a notional principal, one party paying interest at a floating rate, the other 
at a fixed rate. The principal is notional in the sense that it is never paid by 
either party; it is merely used to determine the magnitudes of the payments. 

Fix a period 6 (e.g., half a year) and a set of dates T, = nô, n = 
0,1,..., M +1. Consider a swap with payment dates 7),...,7n¢41 on a no- 
tional principal of 100. At each Tn, the fixed-rate payer pays 1006: this is the 
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simple interest accrued on a principal of 100 over an interval of length 6 at an 
annual rate of R. Denote by Ln-1(Tn—-1) the simple annualized interest rate 
fixed at T;,-1 for the interval [Tn-1, Tn]. (Thus, Ln-1(Tn-1) = L(Tn-1,Tn-1) 
in (C.5), the 6-year LIBOR rate fixed at T;,—1.) The floating-rate payer pays 
100Ln-1(Tn-1)ô at each Tn. The exchange of payments terminates at Tm+1. 

Consider the value of the swap from the perspective of the party paying 
floating to receive fixed; the value to the other party has the same magnitude 
but opposite sign. Although no principal is ever exchanged in the swap, valu- 
ation is simplified if we pretend that each party pays the other 100 at Tyy41. 
These two fictitious payments cancel each other and thus have no effect on 
the value of the swap. With this modification, each party’s payments look like 
those of a bond with a face value of 100, one bond having a fixed coupon 
rate of R, the other having a floating coupon. The value of the swap is the 
difference between the values of the two bonds. 

At time To = 0, the value of the fixed rate bond is (see (C.1)) 


M-+41 
100R5 X` B(0,7;) + 100B(0, Tm+1). 


i=1 


To value the floating rate bond, we argue that it can be replicated with an 
initial investment of 100. Over the interval [0, Ti], the initial investment earns 
100ôLo(0) in interest, precisely enough to pay the first coupon of the floating 
rate bond. The remaining 100 can then be invested at rate Lı (T) until T> 
to fund the next coupon while preserving the original 100. This reinvestment 
process can be repeated until Tm+ı when the 100 is used to pay the bond’s 
principal. Because the cashflows of the floating rate bond can be replicated 
with an initial investment of 100, we conclude that the value of the floating 
rate bond must itself be 100. From the perspective of the floating-for-fixed 
party, the value of the swap is the difference 


M-+1 
100R6 X B(0,T;) + 100B(0, Tm+1) — 100 (C.6) 
i=l 
between the fixed and floating rate bonds. 

By definition, the swap rate at time 0 (for payments at 71,...,71741) is 
the fixed rate R that makes the value of the swap (C.6) equal to zero. Both 
parties would willingly enter into a swap at this rate without either party 
having to make an additional payment to the other—the swap is costless. 


From (C.6), we find that the swap rate is 
Ls B(O, Tm Ly . 
So(0) = a, 

Ô J i= B(0, T;) 


We have subscripted the swap rate by 0 to indicate that this is the rate for a 
swap beginning at time 0 with payments at 7,,..., 7741. The same derivation 
shows that the rate at Tn for a swap with payments at Tn+1, -s 241 is 
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Sx(Tn) = ar 
DDS =n+1 B(Tn, Ti) 
The forward swap rate at time t < In for a swap with payment dates 


Toti ee ,iM+1 iS 


Mai 
PA =n+1 BG, T;) 
An extension of the argument leading to (C.6) shows that this is the rate 


that makes a swap with payment dates Tn+1,...,TmM+1 costless at time t. 
The forward swap rate for a single period (M = n) coincides with the forward 


LIBOR rate L(t, Ta); see (C.5). 


Sa (t) a 


C.2 Interest Rate Derivatives 


The relationships among forward rates, swap rates, and bonds in Section C.1 
are purely algebraic and independent of any stochastic assumptions about 
interest rates. These relationships follow from static no-arbitrage arguments 
that must hold at each valuation date ¢ irrespective of the dynamics of the 
term structure. This is not generally true of the value of interest rate futures 
and options. The relationship between these interest rate derivatives and the 
underlying term structure variables ordinarily depends on how one models 
term structure dynamics. 


Futures 


We describe a slightly simplified version of Eurodollar futures, which are 
among the most actively traded contracts in any market. The futures con- 
tract has a settlement value of 


100: (1 — L(T,T)) 


at the expiration date T, where L(T,T) is the 6-year LIBOR rate in (C.5). 
Through an argument detailed in Section 8.D of Duffie [98], the time-t fu- 
tures price associated with a futures contract is the risk-neutral conditional 
expectation of the settlement value. Define 


La) = E[L(T, T)|F} = E $ (Errr = 1) Fe (C.8) 


with F; the history of market prices up to time t, and the expectation taken 
under the risk-neutral measure. The Eurodollar futures price at time t (for 
settlement at T) is then 100(1 — Lr(t)). The futures contract commits the 
holder to making or receiving payments as the futures price fluctuates through 
a process called resettlement or marking to market. In the idealized case of 
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continuous resettlement, each increment dLr triggers a payment of 100dLr by 
the holder of the contract, with a negative payment interpreted as a dividend. 

The relationship (C.8) between the futures rate Lr and bond prices de- 
pends on stochastic elements of a model of the dynamics of the term structure, 
whereas the relationship between forward rates and bond prices is essentially 
algebraic and independent of the choice of model. The futures rate Êr(t) is a 
martingale under the risk-neutral measure; the forward rate L(t, T) is a mar- 
tingale under the forward measure for date T. The two processes coincide at 


b=. 


Caps and Floors 


An interest rate cap is a portfolio of options that serve to limit the interest paid 
on a floating rate liability over a set of consecutive periods. Each individual 
option in the cap applies to a single period and is called a caplet. Because 
the value of a cap is simply the sum of the values of its component caplets, it 
suffices to discuss valuation of caplets. 

Consider, then, a caplet for the interval [7,7 +6]. A party with a floating 
rate liability over this interval would pay interest equal to dL(T,T) times the 
principal at the end of the interval, with L(T,T) the 6-year rate in (C.5). A 
security designed to limit the interest rate paid to some fixed level K refunds 
the difference 6(L(T, T)—K)) (per unit of principal) if this difference is positive 
and pays nothing otherwise. Thus, the payoff of a caplet is 


(ET =k), (C.9) 


and this payment is made at T + 0. This can also be written as 


(foo (s+) 3] a)” 


using the curve f(T,-) of instantaneous forward rates at time T. 
A floor similarly sets a lower limit on interest payments. A single-period 
floor (a floorlet) for the interval [T, T + 6] pays 6(K — L(T,T))* at T +ô. 
The caplet payoff (C.9) is received at time T +ô but fixed at time T; there 
is no uncertainty in the payoff over the interval [T,T + 6]. Hence, a security 
paying (C.9) at time T + ô is equivalent to one paying 


ò 


Ee T) —K)*+ =6B(T,T +6\(L(T,T)-—K)+ — (C.10) 


at time T. 
The payoff of a caplet (or floorlet) can be replicated by trading in two un- 


derlying assets, the bonds maturing at T and T + 6. Valuing a caplet entails 
determining the initial cost of this trading strategy or, more directly, comput- 
ing the expected present value of the caplet’s payoff. This requires specifying 
a model of the dynamics of the term structure. 
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By market convention, caplet prices are quoted through Black’s formula, 
after Black [49]. This formula equates the time-t price of the caplet to 


5B(t,T + ô) (re T)& ES 


ae (e aliada i = ney) | 


with ® the standard cumulative normal distribution. This expression is what 
one would obtain for the expectation of 6B(t, T+ 6)(L(T,T) — K)* if L(.,T) 


satisfied 
dL(t,T) Ti 


LT) 
though this does not necessarily correspond to a price in the sense of the 
theory of derivatives valuation. In practice, the Black formula is typically used 
in reverse, to extract the “implied volatility” parameter o from the market 
prices of caps. An obvious modification of the formula above produces the 
Black formula for a floor. 


o dW (t), 


Swaptions 


A swaption is an option to enter into a swap. A “2 x 5” or “2-into-5” swaption 
is a two-year option to enter into a five-year swap. A bit more generically, 
consider an option expiring at T» to enter into a swap with payment dates 
Tn+1,---,lu+1- Suppose the option grants the holder the right to pay fixed 
and receive floating on a notional principal of 1. Denote by R the fixed rate 
specified in the underlying swap. At the expiration date Tn, the value of the 


underlying swap is then 


M+41 
V(Tn)=1-R6 X` B(Tn, Ti) — B(Tn,Tu41), 
i=n+1 


by the argument used to derive (C.6). The holder of the option exercises if 
the swap has positive value and otherwise lets the option expire worthless. 
We may therefore think of the swaption as an instrument that pays [V (Ta )|* 
at Ln: 

Using (C.7), we find that 


M+1 
vaes S B(Tn,Ti)(Sn(In) — R)*. (C.11) 
1=n+1 


Hence, the swaption looks like a call option on a swap rate. This formulation 
is convenient because modeling the dynamics of forward swap rates is more 
natural than modeling the dynamics of swap values, just as modeling the 
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dynamics of forward interest rates is more natural than modeling bond prices. 
Furthermore, this representation makes it evident that a caplet is a special 
case of a swaption by taking M = n and comparing with (C.10). 

A swaption can be replicated by trading in bonds maturing at Tn, Ty+1, 
..., TMm+1. Valuing a swaption entails determining the initial cost of this 
trading strategy or, more directly, computing the expected present value of 
the swaption payoff. This requires specifying a model of the dynamics of the 
term structure. 

By market convention, swaption prices are quoted through a version of 
Black’s formula. This formula equates the time-t price of the swaption to 


M+1 
log(Sn(t)/ K) + a” (Th cl) po 
ô Bit, T;) | SnB | M 
J ( ( ayin- t ) 


En jee ei — 22N) 


This expression is what one would obtain for the expectation of 


M-+1 
ô X BY,Ti)(Sn(Tn) — K)* 


i=n+l1 


if the forward swap rate satisfied 


dSn(t) 
a) o dW (t). 


As with the Black formula for caplets, this does not necessarily correspond to 
a price in the sense of the theory of derivatives valuation (but see Jamshidian 
[197] for a setting in which it does). In practice, the Black formula is used to 
extract an implied volatility for swap rates. 
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in quantile estimation, 490 

in random tree method, 439 
variance decomposition, 207 
with acceptance-rejection, 63 
with the normal distribution, 205 


arbitrage 


and discretization error, 159, 374 

and matching underlying assets, 245 

definition, 26 

from parallel shift, 153 

restrictions and control variates, 188, 
191 

restrictions in HJM framework, 151 

restrictions in LIBOR market model, 
169 


arcsine law, 56, 59 
Asian option, 8 


continuously monitored, 99 

control variates for, 189 

delta, 389 

discretely monitored, 99 

likelihood ratio method, 404 

on geometric average, 99, 189, 324 
using Latin hypercube sampling, 242 
vega, 390, 409 

with importance sampling, 270 


associated random variables, 207 
asymptotically optimal importance 


sampling, 264, 270, 499-500, 517, 
531 


b-ary expansion, 285 


algorithm, 296 
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barrier option, 100 
and conditional Monte Carlo, 267 
and default, 521 
and Koksma-Hlawka bound, 290 
discretization error, 368—370, 521 
likelihood ratio method, 405 
on multiple assets, 105 
pathwise method, 393, 395, 400 
random computing time, 11 
using Latin hypercube sampling, 242 
using quasi-Monte Carlo, 334 
with importance sampling, 264 
base-b expansion, 285 
batching, 543 
and control variates, 202 
and moment matching, 246 
and stratified sampling, 216 
Beasley-Springer-Moro approximation, 
67 
Bermudan option, 423 
Bessel process, 132, 373 
beta distribution, 59, 229 
bias, 5 
from discretization, 339-376 
in American option pricing, 427—428, 
433, 435—436, 443, 446—448, 450, 
462, 472 
in control variates, 191, 200, 278 
in finite-difference estimators, 
378-379 
in likelihood ratio method, 407 
in moment matching, 245 
in pathwise method, 393 
in ratio estimators, 234 
sources of, 12—16 
bias-variance tradeoff, 16-19, 365-366, 
381-383, 456 
binomial lattice, 230 
Latin hypercube sampling, 238 
simulating paths, 231 
terminal stratification, 231-232 
Black formula, 566, 567 
for cap, 173 
Black-Scholes formula, 5 
as risk-neutral expectation, 31 
as solution to PDE, 25 
for geometric average option, 100 
in jump-diffusion option pricing, 137 
with dividends, 32 


Black-Scholes PDE, 25 
bonds, 560 
as control variates, 190 
as numeraire, 34, 116, 131, 154, 171 
in CIR model, 128 
in Gaussian short rate models, 
111-118 
in HJM framework, 151, 163 
in LIBOR market model, 166 
in LIBOR market model simulation, 
177 
in Vasicek model, 113-114 
second-order discretization, 358 
subject to default, 520 
Box-Muller method, 65 
and radial stratification, 227 
Brownian bridge, 83 
maximum of, 56, 367 
Brownian bridge construction, 82-86 
algorithm, 84 
and quasi-Monte Carlo, 333-334, 337 
and stratified sampling, 221 
continuous limit, 89 
in multiple dimensions, 91 
Brownian motion 
covariance matrix, 82 
definition, 79 
Latin hypercube sampling of, 238 
maximum of, 56, 360, 367 
multivariate, 90 
stratified sampling of, 221 


cap, 274, 416, 565-566 
and calibration, 180 
delta, 399 
discretization error, 371 
in HJM framework, 163 
in LIBOR market model, 172, 180 
with importance sampling, 275 
with stratified sampling, 225 
caplet, 565 
central limit theorem, 9, 541—544 
contrasted with Koksma-Hlawka 
bound, 289 
for finite-difference estimators, 382 
for poststratified estimators, 235 
for quantile estimator, 490 
for ratio estimators, 234 
using delta method, 204 


with antithetic sampling, 206 
with control variates, 196 
with discretization error, 366 
with Latin hypercube sampling, 241 
with nonlinear controls, 204 
with random number of replications, 
12, 541 
with stratified sampling, 216 
central-difference estimator, 378 
change of measure, 255, 553-557 
and conditional expectation, 556 
and risk measurement, 483, 521 
as change of drift, 557 
as change of intensity, 523 
for risk-neutral pricing, 28 
in American option pricing, 450 
in heavy-tailed model, 279, 515-516, 
537 
in HJM framework, 154 
in LIBOR market model, 171 
in Vasicek model, 117 
through change of numeraire, 33 
through delta-gamma approximation, 
495 
through geometric Brownian motion, 
106 
through weighted Monte Carlo, 253 
characteristic function 
delta-gamma approximation, 487 
indirect delta-gamma approximation, 
513 
inversion integral, 488 
stable laws, 148 
Chebyshev’s inequality, 289, 543 
chi-square y2, 122, 125, 227, 392, 509 
moment generating function, 514 
Cholesky factorization, 72, 486 
for Brownian motion, 82 
CIR model, 120, 128, 524 
delta, 392 
combined random number generators, 
50 : 
commutativity condition, 353 
conditional excess, 483, 506, 519 
conditional Monte Carlo, 279, 369, 399 
for barrier options, 267 
conditional sampling 
in stratified sampling, 211, 214, 
502-504 
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using acceptance-rejection, 61 
using inverse transform, 57 
confidence interval, 6, 541-544 
combining high and low estimators, 
431, 434, 437 
for American option, 431, 434, 437 
through batching, 216, 242, 246, 543 
with antithetic sampling, 206 
with control variates, 196, 198 
with Latin hypercube sampling, 241 
with moment matching, 246 
with stratified sampling, 216 
constant diffusion transformation, 373 
constant elasticity of variance (CEV), 
133 
continuation values, 426 
control variates, 185-205, 277, 420, 440 
and moment matching, 245, 246, 249 
and price sensitivities, 418 
and weighted Monte Carlo, 199, 254 
biased, 278 
compared with stratified sampling, 
220 
confidence intervals, 195-196, 198 
delta estimation, 417 
delta-gamma approximation, 493 
for quantile estimation, 493—494 
in stochastic mesh, 457 
loss factor, 201 
optimal coefficient, 186, 197 
score function, 411 
variance decomposition, 198 
convergence 
modes of, 539-541 
convergence in distribution, 540 
convergence order, 344—348 
copula, 511, 527, 529 
credit rating, 524 
credit risk, 520-535 
variance reduction techniques, 
529-535 
cumulant generating function 
after change of measure, 501 
conditional, 532 
definition, 260 
figure, 264, 265, 498 
of delta-gamma approximation, 487 
of indirect delta-gamma approxima- 
tion, 515 
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default, 121, 520 
and LIBOR, 166 
default intensity, 522-524 
deflator, 26 
delta, 23, 377, 485 
Black-Scholes, 388, 403 
cap, 399 
finite-difference estimators, 383 
likelihood ratio method, 403, 404, 
406, 414 
mixed estimator, 417 
numerical comparison, 412 
path-dependent options, 389, 404 
pathwise, 388, 389, 391 
square-root diffusion, 391 
stochastic volatility, 414 
delta hedging, 23, 377 
and control variates, 194 
delta method, 203, 234 
delta-gamma approximation, 485-488 
and importance sampling, 495-500, 
515-517 
as a control variate, 493—494 
cumulant generating function, 487 
in heavy-tailed model, 512-514 
indirect, 513-515 
with stratified sampling, 500-504, 
517-518 
depth-first processing, 437 
digital nets, 314 
dimension (of integration problem), 3, 
62, 282, 285, 324, 327, 337 
discounting, 4, 559 
by stochastic discount factor, 26 
in LIBOR market model, 168 
in stochastic short rate model, 108 
discrepancy, 283-285 
extreme, 284 
isotropic, 284, 290 
L2, 290 
star, 284 
discretization error, 8, 110, 115, 159, 
339-376 
and change of variables, 371-375 
and control variates, 191 
and likelihood ratio method, 414 
and path adjustment, 250 


and path-dependence, 357—360, 
366-370 

and pathwise method, 396 

as example of bias, 13 

bias-variance tradeoff, 365-366 

in barrier options, 368-370 

in cap deltas, 399 

in caplet pricing, 180 

in discounted bonds, 157 

in forward curve, 155 

in jump-diffusion processes, 363-364 

in LIBOR market model, 174 

in option payoff, 14 

in quantile estimation, 536 

second-order methods, 348-357 
dividends, 31, 96 

and American options, 423, 469 
dual formulation of American option 

pricing, 470-478 

connection with regression, 476—478 

optimal martingale, 472 
dynamic financial analysis, 537 
dynamic programming, 424, 426 


empirical martingale method, 245 
equivalent martingale measure, 28 
error function (Erf), 70 
Euler approximation, 7, 81, 121, 
339—340 
and likelihood ratio method, 414 
and pathwise method, 396 
convergence order, 345-347 
deterministic volatility function, 103 
in LIBOR market model, 175 
in square-root diffusion, 124 
in Vasicek model, 110 
exercise region, 422, 425 
for max option, 464 
parametric approximation, 426 
experimental design, 385 
exponential family, exponential twisting, 
exponential change of measure, 
260, 262, 264, 278, 407 
and default indicators, 530 
and delta-gamma approximation, 
495, 498 
and indirect delta-gamma, 515 
exponential tail, 508 
extreme value theory, 536 


factor loading, 75 
Faure generator matrices, 298 
algorithm, 301 
Faure net, 300 
cyclic property, 300, 331 
Faure sequence, 297-303 
algorithm, 302 
as (t, d)-sequence, 300 
implementation, 300-303 
numerical examples, 325-330 
plateaus, 328 
starting point, 302, 325, 327, 329 
filtration, 545 
finite-difference estimators, 378-386 
bias-variance tradeoff, 381-383 
forward measure 
defined, 34 
in CIR model, 131 
in HJM framework, 154, 159 
in LIBOR market model, 171-172 
in Vasicek model, 117-118 
forward price, 98 
as input to simulation, 102 
forward rate, 560-562 
continuous compounding, 149 
factors, 273 
simple compounding, 165 
forward-difference estimator, 378 
Fundamental Theorem of Asset Pricing, 
27 
futures price, 97, 564 


gamma, 378, 485 
central-difference estimator, 384 
likelihood ratio method, 411-413 
pathwise method, 392 
gamma distribution, 125-127, 143, 508, 
516 
exponential family, 261 
sampling algorithm, 126, 127 
gamma, process, 143-144 
Gaussian short rate models, 108-120 
multifactor, 118 
generalized Faure sequences, 316, 323 
generalized feedback shift register 
methods, 52 
generalized Niederreiter sequences, 316 
geometric Brownian motion, 93-107 
as numeraire, 106 
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derivation of SDE, 93 
Girsanov theorem, 29, 35, 37, 107, 117, 
131, 155, 171, 409, 523, 555-557 
Gray code, 307, 313 
in base b, 308 
Greeks, 377 


Halton sequence, 293-297 
discrepancy, 294 
implementation, 296-297 
in high dimensions, 295 
leaped, 295 
Hammersley points, 294 
Hardy-Krause variation, 287 
of an indicator function, 290 
Heath-Jarrow-Morton (HJM) frame- 
work, 150-155 
discretized, 374 
simulation algorithm, 160-162 
with importance sampling, 273-276 
with path adjustment, 250 
heavy-tailed distributions, 148, 279, 
506-511 
Heston model, 121 
likelihood ratio method, 414 
pathwise method, 392 
second-order discretization, 356 
high estimator, 432-434, 446—447, 472 
and duality, 478 
HJM drift 
and control variates, 191 
discretized, 158, 160 
forward measure, 155 
risk-neutral measure, 152 
Ho-Lee model, 109, 111 
in HJM framework, 154 
hyperbolic model, 146 


importance sampling, 255-276, 278 
and likelihood ratio method, 407 
and stochastic mesh methods, 450 
asymptotic optimality, 264, 270, 

499-500, 517 
combined with stratified sampling, 
271—273, 500-504, 517-518 
deterministic change of drift, 267 
for credit risk, 529-535 
for knock-in option, 264 
for path-dependent options, 267—271 
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optimal drift, 269, 274 
optimal path, 269-271 
through log-linear approximation, 
268-269 
using delta-gamma approximation, 
495-500, 515-517 
with heavy-tailed distributions, 279, 
515-516, 537 
zero-variance estimator, 256 
independent-path construction, 444 
infinitely divisible distribution, 143-145 
stable distribution, 148 
insurance risk, 262, 279, 537 
intensity, 140, 141, 364, 522, 523 
interleaving estimator, 449, 461 
inverse Gaussian distribution, 144 
sampling algorithm, 146 
inverse transform method, 54-58, 67, 
127, 128, 231, 367, 490, 502, 511, 
524 . 
and antithetics, 205 
and Latin hypercube sampling, 237 
and stratified sampling, 212 
avoiding zero, 294 
for conditional sampling, 57 
for Poisson distribution, 128 
in quasi-Monte Carlo, 331 
inversive congruential generator, 52 
Itô’s formula, 545—547 
in operator notation, 348 
Itô-Taylor expansions, 362-363 


jump-diffusion process, 134, 524 
discretization, 363—364 


Karhounen-Loève expansion, 89 

Koksma-Hlawka inequality, 288 
generalizations, 289 

Korobov rules, 317, 329 

kurtosis, 134, 507 


Lévy area, 344 
Lévy construction of Brownian motion, 
90 

Lévy process, 142—149, 364 

Latin hypercube sampling, 236-243, 278 
and normal copula, 527 
confidence intervals for, 241 
in a binomial lattice, 238 


in quantile estimation, 490 
in random tree method, 439 
of Brownian paths, 238 
variance reduction, 240 
lattice rules, 316-320, 325 
extensible, 319 
integration error, 318—319 
Korobov, 317 
rank, 316 
LIBOR, 166, 561 
LIBOR market model, 166-174 
Bermudan swaption, 429 
commutativity condition, 354 
control variate for, 192 
discretization error, 364, 371 
likelihood ratio method, 415—416 
pathwise method, 398-399 
stratified sampling, 224 
transition density, 415, 457 
likelihood ratio, 256-259, 553 
for change of mean and covariance in 
normal family, 497 
for change of mean in normal family, 
260, 533 
for change of numeraire, 33 
for conditional process, 369 
for delta-gamma approximation, 496 
for increased default probability, 530 
for indirect delta-gamma, 515 
geometric Brownian motion, 106 
in Vasicek model, 117 
over long horizon, 259, 409 
over random horizon, 258 
relating objective and risk-neutral 
measures, 28 
skewness, 260 
weights in stochastic mesh, 450—456, 
466, 468 
likelihood ratio method, 401—418 
and regression, 418 
combined with pathwise method, 412, 
416—417 
for barrier option, 405 
for general diffusion processes, 
413-415 
for LIBOR. market model, 415—416 
for square-root diffusion, 406 
gamma, 411-413 
limitations, 407 


variance, 407—411 
vega, 405 
with stochastic volatility, 414 
linear congruential generator, 40, 43-49, 
317 
local martingale, 550 
lognormal distribution, 94, 508 
moments, 95 
lookback option, 101 
discretization error, 360, 367 
loss factor (for control variates), 201 
low discrepancy, 285 
low estimator, 428, 434-437, 443, 
447-448, 450, 462 


Marsaglia-Bray method, 66 
martingale, 550-553 
and duality, 471 
and moment matching, 245 
and stochastic intensity, 522 
control variate, 188 
deflated bond, 169 
discounted asset price, 26, 29 
discounted bond price, 151 
discrete discounted bond prices, 156 
exponential, 117, 151, 552 
from approximate value function, 473 
from Poisson process, 137 
from stochastic integral, 551 
from stopping rule, 474 
futures price, 98 
geometric Brownian motion, 96 
local, 550 
optimal, 472 
score, 410 
martingale discretization, 158, 160, 
374-375 
in LIBOR market model, 176-180 
martingale representation theorem, 36, 
§22, 552 
Milstein schemes, 343, 347, 351 
mixed (iterated) Brownian integrals, 
344 
mod operation, 41 
moment generating function 
chi-square distribution, 514 
delta-gamma approximation, 487 
failure with heavy tails, 512 
normal distribution, 65, 95 


Index 593 


portfolio loss, 530 
moment matching, 244-254, 277 
compared with control variates, 246 
multifactor Gaussian models, 118 
multiple recursive generator (MRG), 50 
in C, 51 


N(p, 07), 63 
N(p, £), 64 
net, 291 


Niederreiter sequences, 314 
Niederreiter-Xing sequences, 316 
noncentral chi-square x/7(A), 122, 123, 
130, 133, 392, 406, 414, 487 
nonlinear control variates, 202—205 
normal distribution, 63—77 
acceptance-rejection method, 60 
conditioning formula, 65 
copula, 527 
for change of variables, 372 
inverse of, 67 
mixture, 507, 509 
multivariate, 64, 71 
numerical approximation, 69 
sampling methods, 60, 65—69 
normal inverse Gaussian process, 
144-147 
numeraire, 19, 32, 255 
and importance sampling, 267 
bond in HJM framework, 154, 160 
bond in LIBOR market model, 171 
for spot measure, 169 
geometric Brownian motion, 106 
in CIR model, 131 
in Gaussian short rate model, 116 


obligor, 520 
optimal allocation of samples, 217—218 
Ornstein-Uhlenbeck process, 108, 132 


parametric value function, 430 
pathwise method, 386—401 
combined with likelihood ratio 
method, 412, 416—417 
conditions for unbiasedness, 393-396 
for general diffusion processes, 
396-397 
for LIBOR market model, 398-399 
limitations, 392, 396 
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smoothing, 399-401 
Poisson distribution, 123, 127—128 
exponential family, 261 
inverse transform method, 128 
Poisson process, 56, 136, 138, 363, 524 
and stratified sampling, 228-229 
inhomogeneous, 140-141 
of insurance claims, 262 
poststratification, 232-235 
and control variates, 494 
and quantile estimation, 494 
asymptotic variance, 235 
compared with stratified sampling, 
235 
of delta-gamma approximation, 504 
predictor-corrector method, 372, 373 
pricing kernel, 26 _ 
primitive polynomials 
defined, 304 
tables of, 305 
principal components, 74 
and value-at-risk, 492 
of discretely observed Brownian 
motion, 8&8 
of forward rates, 184 
optimality of, 75 
principal components construction, 
86-89 
and quasi-Monte Carlo, 334, 337 
in multiple dimensions, 92 
put-call parity, 245 


quantile (fractile, percentile), 55, 347, 
483 
control variates for, 493—494 
discretization error, 536 
estimation, 279, 490—491 
estimation with importance sampling, 
500, 519 
in stratified sampling, 212 
quanto options, 105 
quasi-Monte Carlo, 281-337 
and Brownian bridge construction, 
333 
and principal components, 334 
and stratified sampling, 333 
integration, 282 
numerical comparisons, 323—330 


radial stratification, 227 
radical inverse, 286, 315 
Radon-Nikodym derivative, 28, 33, 256, 
553 
Radon-Nikodym theorem, 553 
random permutation, 237 
random tree methods, 430—441, 478 
depth-first processing, 437 
pruning, 438 
variance reduction, 439 
random walk construction of Brownian 
motion, 81, 91 
randomized quasi-Monte Carlo, 320-323 
ratio estimator 
as example of bias, 12 
central limit theorem for, 234 
ratio-of-uniforms method, 125 
Rayleigh distribution, 56, 367 
regression 
and control variates, 187, 197, 200 
and likelihood ratio method, 418 
and price sensitivities, 418 
and weighted Monte Carlo, 200, 254 
in American option pricing, 459-470 
residuals and duality, 476 
Richardson extrapolation, 360, 364, 375 
risk management, 481—492, 520-535 
risk-neutral measure, 4 
defined, 28 
risk-neutral pricing, 27 
Romberg extrapolation, 360 
Runge-Kutta methods, 351 


score function, 403—406, 408, 419 
as control variate, 411 
generalized for second derivatives, 
411 
variance increase, 410 
scrambled. nets, 322 
second-order discretization, 348—357 
multidimensional, 351—357 
of Heston model, 356 
simplified, 355-357 
seed, random number, 41, 50 
self-financing trading strategy, 22 
simple interest, 165, 559 
Sobol’ generator matrices, 304 
algorithm, 313 
Sobol’ sequence, 303-314 


algorithm, 314 
as (t, d)-sequence, 312 
discrepancy, 312 
implementation, 313-314 
initialization, 309-312 
numerical examples, 325-330 
quality of coordinates, 312, 331 
starting point, 325, 327, 329 
Sobol’s Property A, 309 
spectral test, 48, 319 
spot measure, 168-171 
spread option, 105 
square-root diffusion, 120-134 
algorithm, 124 
and CEV process, 133 
as model of default intensity, 523 
degrees of freedom, 122, 133 
delta, 391, 406 
discretization, 121, 356, 373 
existence and uniqueness, 548 
likelihood ratio method, 406 
noncentrality parameter, 122 
pathwise method, 391 
stationary distribution, 122 
squared Gaussian models, 132 
stable laws, 147, 508 
stable processes, 147—149 
star discrepancy, 284 
stochastic differential equation (SDE), 
339, 548-550 
existence and uniqueness, 548 
for general system of assets, 21 
for geometric Brownian motion, 4 
stochastic mesh methods, 443-459, 
465-470, 478 
computational costs, 456, 466, 478 
high estimator, 446-447 
independent-path construction, 444 
likelihood ratio weights, 450-456 
low estimator, 447-448 
with regression weights, 459, 465 
stochastic volatility, 121, 278, 356, 392, 
400 
and control variate, 192 
stopping rules (for American options), 
425, 428, 443, 447-448, 450, 474 
stratified sampling, 209-235, 237, 277 
and American option pricing, 443 
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and delta-gamma approximation, 
500-504, 517-518 

and quasi-Monte Carlo, 333, 335 

compared with control variates, 220 

confidence intervals for, 216 

in stochastic mesh, 454 

of Brownian motion, 221 

optimal allocation, 217-218 

optimal directions, 226, 271 

variance reduction, 218—220 
strong error criterion, 344 
subordinator, 143 
swap, 164, 562 

in LIBOR market model, 173 
swaption, 566-567 

and calibration, 180 

Bermudan, 429 

in HJM framework, 164 

in LIBOR market model, 173 

with stratified sampling, 225 
systematic sampling, 208, 321 


t distribution, 509-511 
copula, 511 
in confidence interval, 6, 543 
multivariate, 510 
(t, d)-sequence, 291 
terminal stratification, 220-222, 
230-232 
thinning, 141, 364 
(t, m, d)-net, 291 
transition density, 415 
and score function, 410 
in stochastic mesh, 450-453, 457 
of geometric Brownian motion, 404, 
455 
of square-root diffusion, 121 


uniform distribution, Unif[0,1], 54 
uniform integrability, 16, 19, 393, 395 
definition, 541 
unit hypercube, 214 
open versus closed, 282 


value-at-risk (VAR), 483-492, 520 
Van der Corput sequences, 285—287 
variance gamma process, 144 
Vasicek model, 108, 113 
in HJM framework, 154 
stationary distribution, 110 
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vega, 378 
Asian option, 409 
Black-Scholes, 389, 404 
likelihood ratio method, 404, 405 
path-dependent options, 390 
pathwise, 389, 390 


weak convergence, 540 
weak error criterion, 344 


weighted Monte Carlo, 251-254, 277 
and American option pricing, 253, 
459 
and calibration, 252 
and control variates, 199—200, 254 
and price sensitivities, 418 


yield, 560 
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